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ABSTRACT 



This report describes and evaluates the major computer software packages 
capable of computing standard errors for statistics estimated from complex 
samples. We first describe the problem and the proposed solutions. The 
presently available software is then described in general terms. We then 
present in detail the kinds of statistics available from each program and 
discuss the methods of solution each employs. We then compare the program 
documentation and ease of deck setup. The cost of acquiring and running the 
programs is also discussed briefly. Appendix 1 contains technical papers on 
the programs and Appendix 2 describes the major features and options of one 
program more fully. 
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Introduction 



Methods ci hypothesis testing based on nomai statistical theory provide 
valid results when subjects are independently sampled from a well-defined 
population. However, the cluster sampling methods routinely employed in 
surveys do not, in general, provide independent observations. This is 
because the probability of inclusion of a particular case is greater given 
the inclusion of another unit from the same cluster. The inappropriate use 
of statistical methods for independent observations will lead to an underes- 
timate of the actual variance. Thus, confidence bounds for statistics will 
be underestimated, and hypotheses may be falsely rejected by statistical 
tests based on such calculations. Conversely, the use of stratification 
variables can reduce the sampling variance for the variables correlated with 
the stratification variables. In such cases, the use of statistics derived 
for simple random samples will overestimate the error variance and create 
the opposite kinds of errors. 

Kish (1965) has developed a statistic which expresses t:he practical 
effect of complex sampling. He defined the Design Ef feet .(DEFF) as the ratio 
of the sampling error of an estimator taking into account complex sampling 
tc the sampling error of the estimator assuming independent observations. 
Statistics computed from sample surveys ?:ypically have DEFF's greater than one 
(Kish & Frankel, 1974). There IS a strong consensus among survey researchers 
that the problem is important and needs to be given serious consideration 
(Frankel, 1971; Fuller, 1975; Kalton, 1977). 

Based on the large-scale empirical investigation of the effects of 
cluster sampling done by Frankel (1971; see also Kish & Frankel, 1974), we 
can make some general statements concerning its ^ f feet on sampling errors. 
In general, design effects are larger for simple statistics such as totals 
and ratios than for correlations and regression coefficients. Design effects 
of two or greater are not uncommon. Although design effects vary within a 
survey, the design effects for a given type of statistic (e.g., totals) are 
reasonably consistent within a survey. However, the typical design effect 
for totals may differ appreciably from the typical design effect for regres- 
sion coefficients. 
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Use of standard errors computed assuming a simple random sample (SRS) 
results in a high Type I error rate; the methods for estimating standard 
errors from clustered samples disussed in this report, in general, are 
slightly conservative, but the magnitude of the error is small. 

We know of four programs capable of computing standard errors for 
statistics computed from complex samples. Two of the programs, SUPER CARP 
and OSIRIS IV, are highly evolved and contain facilities for handling a wide 
range of problems. The two remaining programs are less well developed and 
therefore handle a more limited range of problems. It appears that any 
improvements to these programs would simply result in the duplication of 
features and options presently available in either or both SUPER CARP and 
OSIRIS IV. Thus, it seems that the choice should be between the two most 
completely developed programs. Before discussing SUPER CARP and OSIRIS IV 
in detail, we will mention the two other programs and indicate their (rela- 
tive) shortcomings. 

A program SURREG, written by Shah, et al. , uses the Taylor series method 
(sp.e below) to compute standard errors for regression coefficients. The 
disadvantage of SURREG is that it does not handle simpler statistics such as 
totals, means, ratios, proportions, and differences betweeen ratios (e.g., 
between subpopulations) . in order to cover these areas, SURREG would have 
to be used in combination with either SUPER CARP or OSIRIS IV. But since 
SUPER CARP and OSIRIS IV already cover all the esses handled by SURREG, it 
would seem to be superfluous. Furthermore, SU??ER CARP handles the same 
cases using the same method of solution and its coverage of regression is 
more complete. 

We are also aware of a program developed at WE3TAT for computing sam- 
pling errors for totals and ratios. It is written as a SAS procedure NASSVAR 
and uses balanced repeated replication methods (see below). We understand 
that the program was written to analyze a particular survey, but it is suf- 
ficiently flexible to be of some general interest. At present there is no 
manual, but appropriate documentation could be assembled. We were unable to 
test the program and, of course, could not review the documentation since no 
"official" documentation exists. 
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Our understanding is that the program is capable of computing a wide 
variety of simple estimates and their sampling errors, such as totals, means, 
ratios, differences between ratios. In its present: form, however, it is 
much more cumbersome to use because the user must supply a design matrix 
which defines the replicates. PROC MATRIX must be called before NASSVAR can 
do its calculations. Furthermore, the balanced repeated replication tech- 
nique requires that the sample be designed with two primary selections per 
stratum. Since no other methods are programmed, this feature would require 
the user to adjust the design of some surveys so that it conforms to this 
pattern. That is, regardless of the "actual" design, the data input to the 
program must have 2 PS per stratum, and an appropriate design matrix must 
exist for the number of stx'ata in the survey- 

In addition, in its current version, it cannot be used for doing more 
complicated analyses such as multiple linear regression. This is because 
the statistic to be estimated must be programm'sd by the user. Thus, it is 
really designed to handle univariate statistics only. However, if work on 
this program continues, it could reach the stage of development of SUPER 
CARP or OSIRIS. 



Major Computer Programs for Computing Sampling Errors 

SUPER CARP and OSIRIS IV are the only two programs that compute sam- 
pling errors for both simple and complex statistics. Both programs have 
been released for general use and one, OSIRIS, is a fully supported program 
package. Of the two, SUPER CARP is the more versatile and powerful. It has 
more features and options than OSIRIS IV and also differs in that it uses 
only one method for computing the sampling errors for all statistics. The 
more unifiea treatment of the problem in SUPER CARP means that it can handle 
surveys of any size and design within a single framework. The coverage of 
the OSIRIS IV program is also broad, particularly with regard to the compu- 
tation of standard errors for simple statistics, but for regression problems 
the user has to attend to ::he method of solution appropriate for that prob- 
lem. Its single most attractive method for regression problems, Balanced 
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Repeated Replications, is not available for "unbalanced" designs or for very 
large balanced designs. 

OSIRIS IV is a major statistical program package developed specifically 
for the analysis of sample surveys. It features a free format control lan- 
guage, general and flexible data cleaning and file manipulation facilities, 
and numerous data analysis subprograms. The program contains facilities for 
storing variable self-descriptors which name and label the variables, define 
missing values, and name the values of the variable levels. This set of 
descriptors is 'known as the OSIRIS "dictionary" and is stored separately from 
the dataset. To carry out statistical analyses, both files are accessed. 
The package also contains SAS to OSIRIS and SPSS to OSIRIS interfaces, 
&SASFILE and &SF3SFILE. They permit the user to bring data stored in other 
systems into OSIRIS IV without having to create the self-described dataset 
manually. If several analyses in OSIRIS are anticipated, &SASFILE or 
&SPSSFILE may be used to create a permanent OSIRIS dataset by simply pro- 
viding the appropriate I/O assignments. The program will then create and 
store the OSIRIS dictionary for later use. Furthermore, both SAS and SPSS 
can read OSIRIS datasets through FROC CONVERT and OSIRIS VARS, respectively. 

OSIRIS IV is unique among the widely available packages in that the 
variable self-descriptors can be augmented with codebook records and stored 
online. The codebook may be printed out separately (e.g., for publication) 
or the portion of the codebook describing the variables used in a statisti- 
cal analysis can be listed on the printout along with the results. This 
option is requested by simply setting the keyword PRINT equal to CODEBOOK on 
the parameter caid. 

A second unique feature of OSIRIS IV is that it contains highly f^volved 
and well integrated facilities for reading, storing, modifying, and analyz- 
ing hierarchical datasets with variable length records. These facilities 
make it possible to bypass complex and difficult data maaa;5ement steps which 
have to be carried out manually on other packages. The OSIRIS IV package is 
described in greater detail in Appendix 2. 

SUPER CARP is a stand-alone program written in FORTRAN G and is based 
on the method of solution givea in Fuller (1975). The control language is 
fixed format, but since the program is more powerful and versatile than any 
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other, it is capable of carrying out analyses that would otherwise be impos- 
sible. These special features will be described in the following section. 
SUPER CARP is more difficult to use, both because the control language is 
fixed format and because it does not contain interfaces to any package 
programs. 

The two programs, OSIRIS IV and SUPER CARP, use the same method of 
solution for simple statistics (Taylor series expansion) but differ in how 
they handle regression and correlation problems. SUPER CARP invariantly 
uses Taylor series, while OSIRIS IV uses repeated replication techniques. 
In OSIRIS IV, thn user has control over the type of repeated replication 
tech- nique he or she wishes to use. Balanced repeated replications is a 
built-in and preferred option. 

In OSIRIS IV the major features of SUPER CARP are split between two 
subprograms, &PSALMS and &REPERR. The mrthods of solution are described in 
Vinter (1980). &REPERR computes sampling errors for means, correlations, 
and regression coefficients, while &PSALMS handles totals, ratios, and dif- 
ferences between ratios. &P3ALMS invariantly employs TAYLOR; &REPERR has 
built-in options for Balanced Repeated Replications (BRR), Jackknife Repeated 
Replications (JRR), and JRR with random selection uf two primary selections 
per stratum*. In addition, the user may define the replicates manually. 
This requires the preparation of k cards, each defining a replicate. 



Major Features and Options of OSIRIS IV and SUPER CARP 

Table I summarizes the statistics computed by these two programs; it is 
clear that there is considerable overlap in coverage between OSIRIS IV and 
SUPER CARP. However, SUPER CARP includes more complete facilities for 
regression problems. The option for regression with a nested error structure 
is appropriate for so-called "pooled cross-section" designs. They are common 
in econometric research and have also been used by sociologists such as 
Michael Hannon. The two-stage cample option computes the sampling variance 
of all requested statistics as the sum of two components. 
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Table 1 

Coverage of OSIRIS IV and SUPER CARP 



Statistic 



SUPER CARP 



OSIRIS IV 
&PSALMS &REPERR 



Totals 
Ratios 
Means 

Difference between 
two ratios 

Subropulation totals 

Subpopulation ratios 

Subpopulation Means 

Subpopulation proportions 

Test of independence 
(contingency table) 

Regression equations 

Correlations 

Standard errors of 
above statistics 

Covariance matrix of 
parameter estimates 

Design effect 

Standard errors assuming 
simple random sample 

Coefficient of variation 

bias 

Covariance of numerator 
and detiomenator 

Intraclass correlation 

Regression with nested 
error structure 

2- St age camples 

Errors-in-variables 
regression models 



X 
X 
X 

X 
X 
X 
X 
X 

X 
X 
X 



X 
X 



X 
X 



X 
X 
X 

X 
X 
X 
X 
X 



X 
X 
X 

X 
X 



X 
X 

X 

X 



ERJC 
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Most notably, SUPER CARP contains several so-called "errors-in- 
variables" models based on the theoretical work of Fuller (1980; Fuller & 
Hidiroglou, 1978). These models assume that the predictor variables are 
measured with error; multiple linear regression ordinarily assumes that the 
predictors are measured perfectly. To use this feature, either the covari- 
ance matrix of tlie errors or the reliabilities of the predictor variables 
must be known (e.g., from previous research). The user must input the matrix 
or a function of the reliabilities. SUPER CAR? does not estimate the reli- 
abilities o£ the predictors, as does a program such as LISREL. It does allow 
the user to provide estimated, rather than known, error variances for the 
predictors. In effect, the program dic;attenuates the regression equation. 

In general, the effect of measurement error is a reduction of the abso- 
lute level of the regression coefficient in relation to its expected value 
in the absence of error (Fuller, 1975). This feature of SUPER CARP is also 
of obvious value in studies uiing simple random sampling; it is possible to 
use the errors-in-variables option in the program independently of the 
options for handling clustered sampling. Thus, SUPER CARP handles two prob- 
lems which frequently arise in survey work: cluster sampling and errors-in- 
variable s. 

OSIRIS IV, on the other hand, computes several useful diagnostic sta-* 
tistics that are missing from SUPER CARP's output. These include the Design 
Effect, the square root of the Design Effect, the bias of a ratio estimate, 
the coefficient of v^^riation (if the coefficient of variation of the denome- 
nator is greater than .15 a warning is printed), and the intraclass corre- 
lation. The Design Effect, in particular, is a well-known statistic that is 
frequently reported as an indicator of the effect of the cluster sampling. 
To obtain the Design Effect using SUPER CARP, one has to rerun the problem 
in a standard package piogram and form the ratio of standard errors manually 
from the information in both printouts. 



Methods of Computing Sampling Errors for Complex Samples 

Taylor Series Expansion 

The first taethod of estimating sampling errors is the Taylor series 
expansion of the first-order statistic (TAYLOR). This method relies on a 
model and makes certain assumptions regarding the adequacy of the model. 
Using this method, a statistic is expressed as a polynomial using the Taylor 
series expansion. The approximate variance of the statistic can then be 
obtained by using only the linear terms of this expansion.* In computing an 
estimate of the variance of the mean, for example, this approach recognizes 
that there is variance associated with both the summed variable as well as 
the count variable. The formulas provide a method of approximating this 
variance estimate. Appendix 1 contains the paper by Professor Fuller which 
gives the expressions used in SUPER CARP. 

In the past the major difficulty with TAYLOR was that it was difficult 
to apply for more complicated statistics such as correlations and regression 
coefficients. The general formulas are given in Tepping (1968), but these 
expressions are cumbersome for complex statistics. This has limited the use 
of TAYLOR to simple statistics where it performs effectively. We will refer 
to the implementation of TAYLOR using the analytic formulas of Tepping as 
TAYLOR-A. 



* The Taylor series expansion is a classical method for approximating a func- 
tion and is used extensively in numerical analysis. The basic idea is that 
a function, f, of a variable (e.g., f can be the mean) can be approximated 
by the polynomial series 

f(x) « f(0) ^ f '(0)x ^ f^\o) + . . . f^""^\o) x''"^ > E 

IT (n-l)l 

Where f(0) is the value of the function at zero 
f'(0)x is the first derivative of x at zero, etc. and 
E is the remainder. 

Extensive work with the Taylor series method has indicated that a satis- 
factory approximation to f can frequently be obtained by using only the 
linear terms of the series; that is, terms in the polynomial beyond the 
first derivative are ignored. The sampling variance of a first-order 
statistic, such as a total, is estimated from the sampling variance of the 
first degree terms of the Taylor series expansion. The main assumption of 
the model is that this constitutes an adequate approximation. 
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The OSIRIS IV subprogram &PSALMS uses TAYLOR-A to estimate sampling 
errors of ratios, ratio means, cotals, and differences between ratios (e.g., 
differences between subclasses). The ratios are typically means or propor- 
tions. A second OSIRIS IV subprogram, &RSPERR, uses repeated replication 
methods to handle means, correlations, and regression coefficients. 

Recently, numerical methods of approximating the covariance matrix of 
regression coefficients have been developed (Fuller, 1975; Woodruff & Causey, 
1976). This work makes it considerably easier to implement TAYLOR on the 
computer. Prior to this, TAYLOR had been regarded as a reasonable method 
only for simple linear estimators such as means and totals, since the ana- 
lytic expressions of Tepping posed considerable computing difficulties. But 
the numerical methods make TAYLOR feasible for other statistics. The pro- 
gram SUPER CARP makes use of the simple numerical estimates given by Fuller 
(1975). We will refer to this method as TAYIOR-N. 

Repeated Replications 

The second approach is known as "repeated replications." Unlike the 
Taylor series expansion, this method is not model dependent and relies on 
the "brute force" computing power available on modern computers. With this 
approach, the variance of a statistic is estimated using the variability 
among k replicates of the full study* Each replicate is erected by exclud- 
ing a subsample of primary selections (PS) in the dataset. The idea is for 
each replication to reproduce, except for size, the design of the enr.ire 
study. The statistic of interest is then estimated for the whole sample and 
for each replication. The variability among these estimates is used to 
estimate the variance of the statistic. Naturally, the precision of this 
estimate increases as more and more replicates are created. For large 
designs with many strata and PS's, the cost of using all possible replicates 
can become excessive. For that reason, several ways of selecting certain 
replicates have been developed. The aim of these strategies is to obtain as 
precise an estimate of the variance as possible with the fewest replicates. 

The best known and most attractive method of selecting replicates is 
known as balanced half-uamples or Balanced Repeated Replications (Kisli & 
Frankel, 1970). BRR is actually a strategy for both survey design and 
analysis. It requires exactly two PS per stratum. Each replicate (half- 
sample) is formed by taking one of the two PS's in each of the H strata. 
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The replicates are selected in smcg a wav that they are mutually orthogonal. 
This pattern of selection is based on sets of orthogonal design matrices 
(Hadamard matrices) developed by Placket t and Burman (1946). By choosing 
balanced half-samples, the n^omber of replicates needed to gain full preci- 
sion can be reduced considerably. With tx^o PS in each of H strata, a total 
of 2 possible replicates exist. However, the maximum number of balanced 
half-samples needed is the smallest number equal to or greater than H which 
is divisible by four. McCarthy (1966) showed formally that, for linear 

estimators, such balanced replications give the same precision as can be 

H 

obtained by taking all 2 replications. This result does not hold exactly 
for more complicated estimators (e.g., correlations, regression coefficients, 
etc.) but Kish and Frankel (1974) conjecture that it holds approximately. 

The &REPERR program uses the Hadamard matrices in Plackett and Burman. 
This restricts the use of BRR to datasets with between 4 and 88 strata. 
This was the largest Hadamard matrix known in 1946. Since then other methods 
of constructing Hadamard matrices have been developed so that BRR, in theory, 
can be extended to other designs. However, these more recently discovered 
matrices are not built into &REPERR. Currently the only way to use them is 
to read the design matrix in manually. Obviously the program could be 
modified to incorporate these matrices. The following discussion describes 
some of these new applications. 

Because of the attractive properties of orthogonal balance, several 
workers have extended this idea to other designs. Gurney and Jewett (1975) 
have generalized BRR to the case of p PS per stratum, where p is any prime 
number. They give a numerical example that employs 110 strata and three PS 
per stratum. We will refer to this method as Generalized BRR (GBRR). 

Two workers have addressed the problem of reducing the number of 
required replications when H is large. Mellor (1973; see also Cochran, 
1?79) and Lee (1972) have developed a class of so-called Partially Balanced 
Repeated Replications (P3RR). PBRR can be applied with two PS per stratum, 
or, when combined with the methods of Gurney and Jewett, p PS per stratum. 
First, the total number of strata is divided into groups with approximately 
H/g strata per group. Then a fully balanced set of replications is used 
within each group. The resulting complete set of replications is not fully 
balanced (across groups) but is balanced within groups. Precision is high- 
est when g is chosen small (e.g., g«2), but costs increase as g decreases. 
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Lee (1973) gives t^ome guidelines on the selection of the PBRR designs in 
comparison to the corresponding BRR design. 

Also, Hadamard matrices larger than order 88 have been discovered • 
Currently, Hadamard matrices of order 200+ have been shown to exist, although 
there are gaps in the sequence. Thus, the conventional design of two PS per 
^itratum can be analyzed by BRR when the number of strata is larger than 88. 

When it is impossible tc apply any of the balanced or partially balanced 
methods because the design itself is unbalanced, the Jackknifing method is 
available (JRR). With this method, replicates are formed by omitting one PS 
per replication and reweighting the other PS's in that stratum, until all PS 
have been omitted. Thus, it can be applied when there are different numbers 
of PS within the strata. However, the resulting replicates are not orthogo- 
nal. &REPERR contains built-in options for JRR and for JRR with random 
selection of two PS per stratum, an option intended for use with designs 
with a large number of strata. Using this option, &REPERR can handle 
designs with many more than 88 strata. 



Comparison of Accuracy of the Methods 

A large scale study of TAYLOR-A, BRR, and JRR was undertaken by Frankel 
(1971; see also Kish & Frankel, 1974). Frankel used an actual dataset (the 
Current Population Survey) and evaluated the methods on both simple linear 
statistics as well as regression coefficients and correlations. He also 
varied thr'number of strata in the design (6, 12, and 30). The performance 
of these methods was evaluated using total error and distribution of t-ratios 
as criteria. All three methods gave good results. TAYLOR-A gave slightly 
lower mean squared error but slightly higher bias, while BRR had the most 
accurate distribution of t-ratios. No method was clearly better or worse 
than the others. 

Woodruff and Causey (1976) compared TAYLOR-N with the three methods 
studied by Frankel- They also used the CPS dataset studied by Frankel. 
However, they extended the investigation to include designs with many more 
strata. They reported results for 6, 12, 30, 90, 270, and 810 strata 
designs. These findings are consistent with Frankel* s: BRR again was the 
best using the t-ratio criterion and Taylor series methods the worst. No 




clear differences between TAYLOR-N and TAYLOR-A were found. TAYLOR-A was 
very slightly superior to TAYLOR-N for the cases in which both were com- 
puted. However, the performance of TAYLOR-N improved as the number of 
strata increased. Using the distribution of t-ratios criterion, the per- 
foriance of TAYLOR-N was excellent, when the design had 90 or more strata. 
This is significant since repeated replication methods become costly for 
large designs. 

The most important message from this work is that all the above methods 
provide good results. It is not reasonable to select or rule out any pro- 
gram solely on the basis of the accuracy of its estimation method. The 
research reviewed here shows that the differenccis among competing methods 
are slight and are dependent on features of the particular design as well as 
criterion used. TAYLOR series methods do worse than BRR when the design has 
few strata, but this difference disappears as the number of strata increases. 
Furthermore, TAYLOR-N methods are both available and economically attractive 
when the design is larger than can be analyzed by BRR. Also, the research 
to date has only used a single dataset; characteristics of the dataset such 
as the degree of interdependence among the observations (i.e., the intraclass 
correlation) can be expected to affect the perfoirmance of the estimators. 
Because these methods are designed to deal with a problem that does not lend 
itself to "airtight" analytic investigation, it Is not really pojsible to 
make a definitive statement* regarding the superiority of one of the methods. 

It should also be mentioned that Lee (1973) reported on the performance 
of PBRR in comparison to BRR. He found that the "best" PBRR designs sacri- 
ficed little in precision (about 12%), but an improperly constructed PBRR 
could be extremely inefficient. No comparisons between replication tech- 
niques and TAYLOR were reported. 



Cost 

In repeated replication methods, both cost and accuracy is a function 
of the number of PS (i.e., the number of replicates). It is thought that 
somewhere between 40 and 100 replicates are needed to achieve acceptable 
precision. With large problems the cost can be high. The cost of the TAYLOR 
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method is relatively independent of the niimber of PS's. SUPER CARP, in par- 
ticular, is quite inexpensive — certaiiily on par with the cost of a compar- 
able analysis in a SPSS regression. In this regard, SUPER CARP appears to 
be superior to &REPERR. 

We were unable to formally test all the programs, mainly because the 
OSIRIS package is presently not available to us. To do a proper evaluation 
of the costs of running each program it is necejjsary to install both on the 
same computer, preferably the computer NCES intends to use. To do this, the 
Osiris package raust be acquired. However, we have set up a computer account 
on the Michigan computer system if some testing is called for prior to 
acquisition. 

SAGE owns a copy of SUPER CARP. Wv^ tested it on a NAEP dataset con- 
sisting of approximately 1000 cases and 110 strata. A regression analysis 
was carried out on both SUPER CARP and SAS using a dozen predictors. The 
cost of the SUPER CARP job was comparable to the SAS run, despite the extra 
computations carried out by SUPER CARP. In this respect, its performance 
was most impressive since a problem with a large number of strata can be 
expensive when carried out using repeated replication techniques. 

Documentation and Ease of Use 

OSIRIS IV has been written to run interactively under MTS (Michigan 
Terminal System), the special operating system at the University of Michigan. 
Thus, considerable effort has already been expended in the area of user con- 
venience. The package features a free format keyword oriented control lan- 
guage, many default options, and mnemonic devices. The language is similar 
to that of SAS with the following major differences: (1) the is used 
instead of the term "PROG", (2) parameters appear on a separate card instead 
of on the first card, and (3) the is not used as a delimiter. ("Nothing" 
is the default delimiter, but the user can set "something", including the 
••;", as the delimiter.) 

User convenience has not been given any priority in the development of 
SUPER CARP. Its main virtue is that it contains "state of the art" features 
for analyzing complex survey data. Unfortunately, the deck setup is entirely 
fixed format and does not employ default options to handle standard cases. 
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The labeling of variables and program output is more limited in SUPER 
CARP. Variable names are limited to 8 characters and do value labels are 
permitted. The program title is limited to 16 characters. OSIRIS variable 
names can be up to 24 characters and value labels up to 8 characters. The 
program title can be up to 100 characters and an unlimited number of comment 
statements can be used. Furthermore, supplementing the OSIRIS dictionary 
with codebook records permits virtually unlimited documentation of the 
variables. There is no limit to the number of codebook records per vari- 
able, and provisions have be^n made for storing a great deal of informatiou 
on each variable, such as frequency distributions and miscellaneous remarks 
concerning problems with variables (e.g,, bad coding, inconsistent coding, 
etc). 

The current documentation for the two programs reflects their differing 
origins and intended audience. The SUPER CARP manual is longer and more 
complete than the &REPERR and SiPSkUiS writeups and contains all the techni- 
cal details on the methods of solution used In the program. It is nearly 
200 pages long and contains several example problems. The examples include 
an explanation of the problem, the deck setup, and the output. Instructions 
for setting up the control cards are complete and fairly easy to follow, but 
since it is a fixed format program with many featur/os and options it is easy 
to make errors (e.g., entering a number in the wrong column or inadvertently 
requesting an unwanted option). Checking a deck setup must be done with the 
manual at hand since no mnemonic devices are used. Clearly, the documenta- 
tion is intended for the experienced programmer who is also knowledgeable in 
statistics and is not afraid cf technical statistical jargon. 

Tho. manual is divided up into sections on program input (i.e., deck 
setup), program algorithms, miscellaneous uses (e.g., special applications 
such as contingency tables), and examples. The section on program input is 
similar in style to the writeup for the old BMD (fixed format) package. For 
example, the user selects the type of analysis by specifying a number, e.g., 
"1" for regression analysis, "3" for total estimation, etc. 

OSIRIS IV is intended for a larger and less technically sophisticated 
audience. The documentation excludes the technical details on the methods 
of solution (i.e., formulas) but this information is available elsewhere. 
The style of the documentation is more similar to SAS than other packages, 
but it is of much higher quality. That is, essential details are not 
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ocaitted and the package is more logically thought out. Each subprogram 
writeup is structured so that it is easy to use the manual as a reference. 
The sections within each subprogram writeup are: 

(1) General Description - A short paragraph describing the main 
function of the program. 

(2) Special Terminology - a glossary for technical terms used in 
the writeup, e.g., "Design Effect". 

(3) Command Features - describes in detail the features and 
options in the program. 

(4) Special features (optional) - describes extra features of 
program such as writing residuals to an output file. 

(5) Printed Output - describes every statistic appearing in the 
output with page estimates. 

(6) Inpi\t data - describes the options for the input of data, 
e.g., raw data or matrix input. 

(7) Restrictions (optional) - limits, if any, to size of problem. 

(8) Control Statements - Gives the directions for setting up the 
control cards. 

(9) Examples •* example deck setups and explanation of the 
problem. 

Several example problems are given for each program. The examples 
include a brief description of the analysis and the deck setup only (no 
output). However, the previous version of OSIRIS provided a separate 
volume of the manual that contained sample outputs only. Perhaps OSIRIS IV 
Well eventually be so documented. 

To further demonstrate the control language of these programs, we now 
illustrate how each program can be used to do thft same calculations. This 
is, essentially, the first example in the SUPE^. CARP manual. 

The following SUPER CARP deck setup illustrates the estimation of 
totals, ratios, and multiple regression equations. This is followed by an 
OSIRIS deck se'-up which computes the same statistics. 
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EXAMPLE 1 9 1 4 1 0 41 

(FORTRAN FORMAT CARD) 
Y XI X2 X3 

0 

.10 .05 

9 

(DATA CARDS) 
3 2 

1 2 

2 2 
1 5 

14 1 1 1 

15 3 4 
2 

3 4 

13 10 
1 2 4 
(ENDFILE) 



The explanation of these control cards is given in the following four 
pages, taken from the nanual. The control language is typical of special- 
ized stand-alone statistical programs: A few cards are used to "set up" the 
problem (Parameter card, format card, variable label card) and these are 
followed by an unlimited number of "analysis packets" which request specific 
data analyses. 



21 



Explanation of Super Carp as Taken from the Program Manual 



Record 1: PARAMETER CARD 



Columns Coding Comments 

1-16 Example No. 1 Title for output 

20-2^4- 9 Number of observations 

26 1 Stratum sampling rates will be read in 

3^-35 h Nmber of variables read in 

^2 1 Program will generate variable 5 as a column 

of I's, variable 5 will be named INTERCPT 
hk 0 Pull survey type - (stratum, cluster and 

weight read in) 
k Kmber of emalyses to be performed 

hQ 1 All data to be listed in output 

Record 2: POHMAT CARD 

Columns Coding 

1-^ (2(11, IX), 5(F2.0, IX)) 

Record 2a: VARIABLE NAME CARD 

Columns Coding Comments 

1-8 Y Dependent variable 

9-l6 XI 
17-2^ X2 Independent variables 

25-32 X3 

Record 3: SCFxEENING CARD 

Column Coding Comments 

2 0 No screening reqLuired 
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Record h: SAMPLING RATES CARD 1 



Columns 
1-8 
9-16 



Coding 
0.10 
0.05 



Comments 
Sampling rate for stratim 1 
Sampling rate for stratim 2 



Column 
8 



Coding 
9 



Record h: SAMPLING RATES CARD 2 

Comments 



Indicates end of saiapling rates 



DATA 

Data are entered on Cards in the specified input format. 



Column s 
k 
8 



Record 5: ANALYSIS CARD 1 
Coding^ Comments 



3 
2 



Total estimation 

Number of variables to be included in analysis 



Coltmin 
k 
8 



Columns 



k 
8 



Record 6: VARIABLE IDENTIFICATION CARD 1 

Coding Comments 

1 Y total to be obtained 

2 XI total to be obtained 

Record 5: ANALYSIS CARD 2 

Coding Comments 

2 Ratio estimation 

2 Number of variables to be included in analysis 



Columns 
8 



Record 6: VARIABLE IDENTIFICATION CARD 2 

Comments 



Coding 
1 
5 



Y is variable in numerator of the ratio 
INTERCPT is variable in denominator of the 
ratio 
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Columns 
8 



12 
16 

28 



Coluims 
k 
8 
12 
16 



Coding 

1 
k 

1 
1 
1 



Record 5: ANALYSIS CARD 3 

Comments 

Regression 



Number of variables to be included in 

analysis 
Weighted least sqLuares 
Intercept indicator is req.uired 
1 F-test is required 



Record 6: VARL&BLS IDENTIFICATION CARD 3 

Comments 
Y is dependent variable 

INTERCPT is always listed after dependent variable 
X2 

Other independent variables 

X3 




Columns 



Record 6c: NUMBER OF COEFFICIENTS TO BE TESTED 



Coding 
2 . 



Comments 



2 coefficients are to be tested 



Record 6d: IDENTIFICATION OF COEFFICIENTS TO BE TESTED 
Columns • Codi.ng 

k 3 X2 Variables -whose regression 

8 ^ X3 coefficients are to be tested 
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Columns 
8 



12 
16 



Coding 
1 
3 

1 

0 



Record 5: AMLYSIS CARD If 

Coiriinents 

Regression 

Number of variables to be 

analysis 
Weighted least squares 
ITo intercept in the regres 



Coliiinns 

k 

8. 

12 



Record 6; 

Codin g 
1 
2 
k 



VARIABLE IDENTIFICATION CARD ^ 
Commf nts 



Y is dependent variable 
XI 
X3 



independent variables 
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The OSIRIS deck setup is more typical of a modern program package. A 
convention of the OSIRIS package is that variables art identified by vari- 
able number in the deck setup and by name in the output. The OSIRIS dic- 
tionary is included to clarify the meaning of the control cards. Comments on 
each control card are contained in brackets. 



&DICT 

Create self-descriptors fCi 
PRINT-DICT 

V«l NAME-STRATUM COL-1 



V«2 NAME-* PRIMARY SELECTION* 
V«3 NAME«*CASE WEIGHT* COL-5 
V«4 NAME-* DEPENDENT VARIABLE* 
V«5 NAME«*PREDICTOR 1* 
V»6 NAME-* PREDICTOR 2* 
NAME-* PREDICTOR 3* 



V«7 

&END 



[Calls the program which creates 
the self descriptors] 
the SUPER CARP example [Label card] 

[Parameter card - print input 
dictionary] 
[Variable descriptors 
indicate starting 
column number aud variable 
width. Width defaults to 1 
if omitted. COL spec defaults 
to previous spec plus value 
of WIDTH] 



COL-3 



COL-8 W-2 



&REGODE 
Rl-1 
&END 
&PSALMS 

ratio and total estimation 
WTVAR-V3 SECU-V2 STRATUM-Vl 

STRATA«1,2 SECU«6,3 MODEL"M[JLT 



[End of &DICT specs. Optional in batch mode, needed 
in interactive mode] 

[Invokes OSIRIS r code facility] 
[Variable Rl is set equal to constant 1] 



NAME-* TOTAL Y* PAR-V4/R1 CONSTANT-9 
NAME-* TOTAL PREDICTOR 1* PAR-V5/R1 CONSTANT-9 
NAME-* MEAN Y* PAR-V4/R1 



[Invokes 6t?SAMS] 
[Title card] 
[parameter card identifies 
first three vars] 
[Use strata 1 and 2 containing 6 
and 3 SECU*s] 
[Each card defines a population 



parameter to be estimated. 
It is defined using the 
PAR keyword] 



[Invokes &REPERR] 



&END 
&REPERR 

multiple linear regression using repeated replications 
STRATA-Vl SECU-V2 WTVAR-V3 VAXS-V4-V7 PRINT-REP - [Parameter card. Dash 

is used to indicate 
continuation] 



STATS-( MEANS , CORR^MULTR) 
STRATUM-1,2 SECU-6,3 MODEL- JKM 
DEPV-V4 VARS-V5-V7 



[Defines strata and model to be used] 
[Specifies the vars. in regression] 



Aside from the differences caused by the fixed format versus keyword 
based control cards, several other features of the programs? merit comment. 
The underlying structure of SUPER CARP is such that it could be concerted to 



a very convenient program by replacing the fixed format control language. A 
single method is used to handle all types of problems, and it is easy to 
request a wide variety of different analyses for the whole sample as well as 
any number of subpopulations in the same run* However, the lack of a free 
format keyword oriented control language makes access to these features 
relatively inconvenient. 

OSIRIS &PSALMS appears to have already capitalized on the power of the 
Taylor series method as well as the ease of use of a modern keyword language* 
This particular example hardly begins to illustrate the range of analyses 
that can be conveniently carried out with the program. As with SUPER CARP, 
a single run can calculate any number of different estimates for both the 
totaL sample and an unlimited number of subpopulations. In &PSALMS both the 
estimates and the subpopulations can be labeled, supplementing the labeling 
provided by the dictionary and codebook. 

The &REPERR program permits the user to request an unlimited number of 
regression analyses within a single run. Each regression analysis may be 
given its: own label (up to 32 characters). However, it does not permit the 
user to analyze different subsets of the data within the same run, as does 
&PSALMS and SUPER CARP. To accomplish this, the user must invoke the &REPERR 
program once for each subset of the data he or she wishes to analyze. 

When the BRR method is not being used, &REPERR is slightly less con- 
venient to set up because the user has to provide the number of primary 
sele^itions within each stratum (The BRR option, of course, assumes there are 
two). For very large designs this can become tedious. SUPER CARP does not 
require the user to furnish this information. This, of course, is still far 
more convenient than a repeated replications program such as the WESTAT 
program, which, at present, requires the user to input the Plackett-Burman 
design matrix for every problem. 

Both OSIRIS programa, of course, permit the user to analyze any subset 
of the total number of variables in the dataaet. The variables denoting the 
strata, primary selections, and case weights can appear anywhere in the file 
and are identified on the parameter card. SUPER CARP, at present, requires 
these variables to appear first in the dataset and in that order; they are 
identified by their location in the file. 
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Availability 



We already have the SUPER CARP program, manual, and relevant literature. 
It can easily be installed on any IBM or IBM compatabile machine such as 
COMNET or NIH. For another AIR project, we installed it on the Stanford IBM 
370/3081 without difficulty. 

OSIRIS IV is available from the Institute for Social Research*. The cost 
of the program and terms of the agreement are given in Appendix 2* It also 
can be installed on an IBM or IBM compatable machine. OSIRIS IV operates 
in batch mode under a conventional operating system such as at NIH or COMNET. 

Another option is to use OSIRIS IV at the University of Michigan via 
TELENET. However, this would entail moving from one system to another and 
create logistical problems in obtaining hardcopy output. The advantage is 
that OSIRIS IV runs in a true interactive environment at Michigan, and JCL 
system commands have been replaced with a simple mnemonic keyword language. 
No system commands at all are needed to access disk files, because the 
operating system is capable of reading I/O assignment statements on program 
control cards. Tape files may be read after issuing the $MOUNT command. 
When using the computer interactively, system (MTS) commands may be issued 
at any point in the job stream; control will automatically be returned to 
i'f£S and then be returned to the program package when the requested task is 
completed. 



Possible Modifications 

Of course, the major disadvantages of SUPER CARP are its antiquated 
control language and its status as a "stand-alone" program. These problems 
would be completely overcome if SUPER CARP were brought into SAS as a pro- 
cedure. In the fall (1981) Professor Fuller told us that SAS is, indeed, 
planning to incorporate SUPER CARP into their program as a fully supported 
subprogram. However, they indicated that the conversion would not take 
place until they had hired an additional programmer. We attempted to con- 
tact Professor Fuller in June regarding any progress made by SAS. We were 
told that he was out of the country until July, but a graduate student 
involved in the development of SUPER CARP told us that he had interviewed 
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for the SAS position and it was still not filled. Thus, work on converting 
SUPER CARP to an SAS procedure apparently has not started. 

Converting SUPER CARP to an SAS procedure is a task SAGE can undertake 
if NOES indicates a need for its additional features* However, we will want 
to discuss this with Profp.ssor Fuller upon his return to the U.S. before 
taking responsibility for beginning this work. 



Sunmiary 

Clearly, the two major programs reviewed here £\ppear to be very sound 
alternatives. Both programs compute sampling errors for a wide variety of 
statistics, use proven methods of solution, and are adaptable to a wide 
variety of sampling designs. However, each program's special strengths lie 
in different areas. SUPER CARP*s use of the Taylor series expansion means 
that it can handle surveys of any size and design without the user's having 
to worry about choosing an appropriate method or provide the program with 
additional infvormation about the survey design. TAYLOR-N seems to be an 
especially sound method for doing multiple linear regression on large 
designs, e.g., over 100 strata. If NCES has a special need for regression 
analyses on these large designs, SUPER CARP would appear to be the program 
of choice. Of course, SUPER CARP also contains facilities for doing spe- 
cialized analyses such as errors-in-variables models that are not available 
in any other program. 

OSIRIS IV, on the other hand, is far stronger in the areas of user con- 
venience and clarity of the documentation (to the non-sophisticated user). 
The &SASFILE and &SPSSFILE commands give users ii^aediate and easy access to 
data stored in other systems and data stored initially in OSIRIS can easily 
be analyzed in SAS or SPSS via their interfaces. The keyword oriented con- 
trol language makes the program much easier to learn and use. &PSALMS, in 
particular, is a very attractive program, because it combines the power and 
flexibility of the Taylor series method with the keyword oriented control 
language of a modern package program. If NCES*s needs lie mainly in the 
areas covered by &PSALMS, it would seem to be the program of choice. &REPERR 
can be relatively tedious to set up for large unbalanced problems, but the 
program is capable of handling a wide variety of designs. The built in 
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options for Jackknife. repeated replications and Jackknife repeated replica- 
tions with random selection of two PS per stratum provide for regression 
analyses of large unbalanced designs; however these analyses are likely to 
be more expensive than the corresponding analyses in SUPER CARP. 

Because OSIRIS is designed specifically to manage and analyze large- 
scale surveys, it contains other special features that are especially valu- 
able to the survey researcher. The package is very obviously more fully 
self-documenting than any other program. The dictionary and codebook 
records may be stored jointly and accessed by any analysis program so that 
each run is fully and permanently documented. Also, the package offers very 
efficient ways of storing and managiiig large files. The portions of the 
program dealing with so-called structured files permit the user to store 
hierarchical datasets by storing each datapoint only once and without pad- 
ding a dataset with missing values to make it rectangular. Furthermore, the 
package offers a wide variety of storage modes to maximize efficiency. For 
example, questionnaire response data can be stored in half-byte Integer mode 
(4 bits); the most compact storage mode for numeric data available in SAS is 
two-byte (16 bits). For a dataset consisting of hundreds of such variables, 
the reduction in storage costs can be considerable. 

Many of these advantages of OSIRIS would be offset if SUPER CARP were 
brought into SAS. In the long run, it appears very likely that this will 
happen. In the short run, however, it is unlikely that the staff of SAS 
will complete the conversion. It is in this area that SAGE may be able to 
help if NCES decides that SUPER CARP best meetis their needs. 

It is emphasized that since no method of solution has been shown to be 
clearly more accurate than another, this should not be a niajor deciding 
factor. In fact^ program authors tend to choose one method over another 
for reasons of convenience rather than quality. It was decided to use Taylor 
series methods in SUPER CARP because of its great flexibility. It can be 
applied to any size survey no matter how large or unbalanced. OSIRIS uses 
Taylor series methods in &PSALMS for largely the same reasons, but used 
repeated replications to handle regression problems because those workers 
did not have access to the num^irical methods developed by Fuller or Woodruff 
and Causey. Finally, repeated replication methods were used by WESTAT 
because it is a simple "brute force" procedure that can be applied to any 
statistic the user may care to estimate. Our judgment is that a choice of 
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software should be made on the basis of NCES's specific needs (e,g,, types 
of statistics needed, design of surveys, sophistication of programme , 
etc.) rather than on the basis of any "intrinsic" superiority of one approach 
over another. 

As a final note, it should be recognized that our testing of the pro- 
grams has been necessarily limited due to the short time available to pre- 
pare this report. Prior to making a final selection of programs, some Monte 
Carlo simulations to evaluate the accuracy of the methods or analyses of .real 
data could be conducted using all available programs. Such testing would 
help determine the costs of using each program as well as the value of 
variotas unique features available in each package. 
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Appendix 1 

Technical papers on methods of solution in SUPER CARP and OSIRIS 
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REGRESSION ANALYSIS FOR SAMPLE SURVEY 

By WAYNE A. FULLER' 
Iowa State University 

SUMMARY. We Investigate the estimation of regression equationa for a sample selected 
from a finite population. In aU derivations, the finite populatioc is treated as a sample from an 
mfinite population. Tha regression, ooeffioionts nro shown to be asymptotwally normal, given 
mild assumptions. Relatively simple expressions for the covariance matrix of the regression 
coefficients are presented. 

Procedures for estimating the structural parameters in the presence of response error are pre- 
sented. Given knowledge of the response variance, the computations required to estimate the 
structural parameters and their standard errors are essentially equivalent to those required for the 
computation of the regression coefficient and its standard error in the absence of response error. 

1. Intboduction 
In many scientific investigation.s, regression equations are computed 
for survey data. Examples of nuch studies include a consumption function 
for textiles where consumption is expressed as a function of income and occu- 
pation, a production function where farm output is expressed as a function 
of land and other inputs, and a function relating leisure activity to age and 
education. 

The comparison of domain means is a kind of regression analysis wherein 
tha independent variables take on only discrete values, such as 0 and 1. The 
sui-vey literature, begmniug in the I950's (Yates, 1949) contains considerable 
material on the estimation of domain mean.s and their differences. The term.?, 
analytic studies (see Hartley, 1950) and analj-tic surveys, have become part 
of the survey statistician's vocabulaiy. Other than the work on analytic 
surveys there is very little literature dealing \\ith the e.stiiiuition of regi-e.ssiou 
equations from survey data. One exception is the di.sou.^^ioii of Deiniiig (1950) 
whcieiii the problems of identifying the population for ^,]M\ infereiu^es are 
desired is discussed. 

Konijn (1962) considered the problem of estimating a regre-s-sion equation 
from sui-vey data. Given a population of N clu.sters of size i = 1, 2, . . N, 
he assumed that the elements in each cluster were a random sample froin an 
mfimto population satisfy ing the usual linear model, y = a,-f y?j.r, i = 1, 2 N. 

S.^t;on*I°S low^" l°oi«t'v!f ,°/ni^'Vi'?'^'' Agriculture a'^HomJ E'.on'oniics'Ei^^^^l^t" 
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Koiiijn defuied the iDaraiucters of interest to be tlie \veiglit<jd p-verage 
of the N chister parameters, where tlie weight? were the cluster sizes. Thus, 
the slope parameter of interest was /? = ( 2 AU)"^ 2 Mifii. Given a sample 
of elements in each of a sample of n clusters, the suggested estimators were 
the weiglit^d average of the n sample regression estimators computed for the 
n elust^rs. Under the model, the estimators are unbiased for a and y?. The 
variance exi^rcssion contained a coinponent associated \nth tlie variabihtv of 
the as es^timators of and a component arising fiom the fact that a sample 
of the fii is cho.':;en from the fiiut<^ population of fii, 

Frankel (1971) studied the empirical behavior of multiple regression co*^ffi- 
cients computed from a chii>ter sample. The data were a sample of U.S. 
households collected by the U.S. Bureau of the Census in the March 1967 
Current Population Survey. In Frankel's study, the objective of the regres- 
sion analysis was the esthnation of the finite population parameters as defined 
by the finite population moments. For the data studied by Fi-ankel, the 
simple least squares estimators comput<>d from an equal probability cluster 
sample displayed little bias, but the usual least squares estimate of variance 
underestimated the sampling variance by about 10 percent. 

In FrankePs study, variance estimators based on Taylor approximations, 
on balanced repeated replication and on jack-knife repeat^ed replications, 
all gave reasonable estimates of the variance of the estimated regression 
coefficients. 

There has been little explicit analjiiic treatment of the sampling properties 
of regression coefficients in the sampling literature. Frankel (1971) used the 
Taylor approximations to tJie variance suggested by Tepping (1968). In the 
general form presented by Tepping, these formulas are cumbersome for multiple 
regression equations. As we. sJiall see... rather simple representaSbrw are 
possible JiSr the lestimateiLcovanance matrix of the regression coefficiexits. 

It is recbgnized that much data collected in sample surveys, particularly 
that collect^idL from human respondents, are subject to measurement error. 
The .trS::i5ureau of the Census (1972) has reported estimates of the response 
variance* .as a percenuige of^tal variance, that range from 0.5 to 40 per cent. 
Battese^fit. al. (1972) "report re^onse variances of a similar magnitude for. items 
associated with farm operations. 

In a simple regression, the effect of imcorrelated response error in the 
independent variable is a reduction in the absolute value of the expected value 
of the regression coefficient relative to the expected value in the absence of 
response error. If the response errors in the independent and dependent 
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variables are correlated, the bias is increased or decreased, depending upon 
the signs of the error correlation and of the regression coefficient. Cochran 
(1968) and Chai (1971) discussed the effect of response variance on regression 
statistics. Fuller (1971) has investigated the properties of errors in variables 
estimators of regression parameters under the assumption of an infinite model 
with normal errors. 

We shall consider the estimation of regression equations from samples 
selected from finite populations. It shall be assumed that the finite popula- 
tion is a random sample from an infinite population, 

As.suming that the covariance matrix of the response errors is know, 
we present an estimator of the regiession coefficients under a model that does 
not cv?siime identically di,stributed respon.^e errors for each respondent. 

2. LiMirrs-G distribution or the vector of regression coefficients 
To investigate the behaviour of the estimator of the finite population 
regression roefficieut, it seems necessary to make distributional assumptions 
or (and) to use large sample approximations. In thi.s section, we investigate 
the limiting behavior of the estimated coefficients as both the sample size 
and the population size become large. 

In investigating limiting properties of estimators, one must specify a 
sequence of raiite populations and samples from these populations. One 
procedure, and that followed by Hajek (1960) and Madow (1948), is to 
treat the sequence of finite populations as a sequence of fixed numbers 
pos.sessing certain well defined limiting properties. The required limits are 
rouglily analogous to the existence of moments. An alternative approach, 
and the one we adopt, is to assume that the finite population is a random sample 
from a multivariate infinite population with finite fourth momenta. It is also 
assumed that the covariance matrix of this multivariate population is po.sitivo 
definite. 

We define the finite population vector of (;)-f 1) regression coefficients by 

B-=Q:^'H^' ... (1) 

and the infinite population vector of coefficients by 

JS^Q-^H, ... (2) 

where the w-th, r, * = 0, 1, 2, jo, elements of and Q are 

N 



ae 
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respectively, the ^-th elements' of ff^ and H are 

N 

respectively, and cci^^l. 

The sample estimator of /?, based upon a simple random sronple of size 
w, is given by 

^"-Qn'U^. ... (3) 

where the ^-^-th element of Q„ is 



n 

1-1 ^ ^ ' 

and tlie ^-th element of H„ is 

h„4 = n-i a:„y<, 5 = o, 1, 2, p. 



Theorem 1 : Let {5, : n = 1, 2, .-..} be a sequence of finite populations, 
where l„isa random sample of size N„, N„ > N„_„ selected from a p-difnensional 
infinite population. Assume the infinite population possesses finite fourth moments 
and a positive definite covariance matrix. Lzt a simple random nonreplacenient 
sample of size n be selected from the n-th finite papulation, n = 1, 2, .. . Define 
fn = nlN„ and let 



Then 



Urn f„=f 0</<l. 



ni{b-B)iN{0, a-f)Q-^GQ-^) 



as n-> 00, where b is defined in (3), J? ts defined in (1\ the rs-th element of O is 

Grt = E{xirxuef}, 
and the population error, e, is defined- by e,= y<— S fi^Xir. 



Proof: We may writei 



J We simplify tho notation in subsequent discussion by dropping the subscript n from JV. 
For the some reason we have not subscripted 6 and S with n. 
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where B^^l\ i a, ^Xi^eu 2 x^^aV 

and iJiv^ is defined analogously. 

Since the elements of are sample moments mth variances of order 
n~^, we have 

Qn-Q = Oj,{n-*) 



Qn-Q = 



and 



n^h-B) = niQ-HIi„-Rtf)+Op{n-i) 



= <2 



1-1 



N 



n-»(l-A) 2 c,-/Jt(l^/J»(xV-n)-* S a 



n-*(l-/„) 2 a:,pe,-/*(l-/„)t(2^-»)-» 2 a^^e, 



Now, E{xi)ei}^0, i = 0, 1,2, ...,^), is finite for j = 0, 1, 

and the vectors, (e^, ar^j^^, ar^^^i, ...,a:{pe<), i = 1, 2, ... are independently and 
identically distributed. Tkeiohve, letting A = (Aq, A^, A^, Ap) be a vector 
of arbitrary real numbers (not all zero) the linear combination 

£ A;n-»{l-/„) S Xi^a 



converges in distribution to a normal random variable with mean zero and 
variance (1— /)^A'(?A by the Lindeberg Central Limit Theorem. In a similar 
manner 

5a„= 2 Ay/^(l-A)*(iV-n)-* 2 Xijei 

converges in distribution to a normal random variable with mean zero and 
variance 

/(1-/)A'GA. 

As Sin ^2n independent, the result follows. □ 

It follows from Theorem 1 that n*(6— /5)-> iY(0, Q-iffQ-i). Thus, in 
analogy to analytic surveys (see Cochran, 1963, p. 37), one would not use the 
finite population correction if one were estimating the infinito population 
parameter. 
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It is of considerable interest that a consistent ostimator of the variance 
of ni(b-JS) can be constructed by ostimnting the matrix G by. 

^■i' = (^<o» ^fp) is the i-th row of the matrix used in constructing 

Theorem 2 : Let the sequence of samples and Jinite popxihtioiis satisfy 
the assumptions of Theorem 1 and let G he defined in (4). Then pJim G = 6?. 
Proof: We have 

1 « .i 1 « 



1 « 1 n 

^ 'n ^^"^ 7 ^* ^^"^ ^i'i^-J3) 

- ^ £ ^< ^^-^ 
where the 5777-th element of 7 is given by 

= — r Xi/ a-f, Xir Xj^. 

By Khinchine's weak law of large numbers (Rao, 1965 p. 92) 

1 P 
— 2^ av, x,r Xf^} 

A» 6— »>J = Ojjtn^^) the result follows. □ 

Although Theorem 2 presents a method of obtaining a consistent esti- 
mator of the variance in the absence of the usual assumptions of the linear 
model, it is clear that this estimator ^vill be less eflScient than the linear model 
estimator if E{ei \xi) = 0 and E{e^^ | Xf} = o-^ for all in the in finite population. 
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By use of the Central Limit Theorem of Appendix A, Theorem 1 can be 
extended to the regression coefficient estimated from two-stage stratified 
samples. 

For a two-stage stratified sample selected from L strata, the rs-ih element 
ot the matrix Q„ is given by 

where 

W) = fraction of pi pulation in stratum j, 
Vi = number of ' .imaries selected from stratum j, 
Mji = number of elements in primary % of stratum j, 
rrijt = number of sample elements in primary- i of stratum j. 
Xjur = the jit-ih observation on the r-th independent variable. 
Similarly, the «-th element of H„ is 

n, Crl^ ^^'^'"'y}"- - (6) 

The elements of the matrix ti-i G are given by the usual formulas for the 
estimated covarianceg of means per primary for stratified two-atage samplin- 
nsmg as the variables in the variance fonmilas 

djitr = Xjttr ejtt, r=l,2 p, 

where e,u = yni- S brXjur- 

r-o 



If tlie ftnite correction terms ar^ igno. ed. the r.-th element of G is given by 
^ W} 1 - ^ ^ 



where _ i 4 f 



and 



L 

n= S ny. 
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3. Regression estimates i>i the phese>'ce of resfo>;se ehror 
We present a method for incorporating kno-\vledge of tlie variance of the 
measurement error into the regression cancalj-sis. 

We first generalize the model associated with equation (1) to include 
measurement error. We assume the vector (y, x^, x^, .rp) to be a random 
sample from a multivariate population with finite 4+5, d> 0, nioments. 
As before, we include the unit independent variable in tlie vector of .r's if the 
intercept term appears in the model. Thus, for a niodel vriih inten ept, the 
first X is aUvays identically one. The infinite population regression parainet<irs 
are defined by equation (2). 

» 

If a sample is selected from the population we do not observe y and x, 
but observe 

Yi = i = 1, 2, n, 

Xik = Xijc-i-itik, k = 1, 2, p, 

where wi = ti<^, m^, ut^) is the vector of response errors for the i-th 
observation. It is assumed that 

^{w?<|(y<,^U,ar<2,...,Xip)}=r= 0 

for all vectors {yu xi^, a^ij, X{p), It is also assumed that wi is independent 
of wj, i z^j. We assume that the 4+5, 5 > 0, moment of Wi is imiformly 
bomided. Note that we do not assume that the covariance matrix of w 
conditional on (y<, Xi^, xi^, artp) is constant for all (y^, Xi^, xi^, xip). The 
second moment matrix of (Y, X^. Xj, Xj,) is given by the sum 

where the first matrix is the matrix of second moments oi{y^x^,x^, .,.,Xp) 
and the second matrix is the covariance matrix of We assume that S 
is nonsingular and that the covariance matrix of w is known. 

Theorem 3 : Given the stated- assumptions, 
at n— » 00. Further 

(Sxz-S„„)-ii(S„-S„„)-i 
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is a consistent estimator of the variance of $ where 



p 












1 » 




A 




n 1-1 




Xi, 


= 


(Xii, Xip) 




A 












« /-I * 




Vi 
























A 




1 n A ^ 













= 7i-Xi.^. 
Proof: By the moment assmnptions 

Sxx = L„+S«a+Op(n-J) 

Sxy = Sxv+S«.+C>,(7^-*) 
and, becau-se S,, is nonsingular 

Therefore 

Since S^, — Suu = n-i S (Z^.vj— Sao) is a mean of independent random vari- 
ables with finite 2-1-^* moments 

5 A'(0,S«M S-») 

2 
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by the Liapounov form of the mnltivariate central h'mit theorem. Now 
vt = yi-Xi.{$-fi) = Vi-\-Op{n-i) 

a-nd X'f.ViViXt. = z;.i',r,X<.+ 0,(71-*). 

The elements of the matrix • 

f (Z;.t»,-S„„){r,Z,.-I,„) 

are the means of independent random variables with finite l+iS nionvonts, 
and it follows by Markov's weak law of large numbers (see Parzen, 1960 p.418) 
that 

1 " -P 

Corollary 3.1 : Given a sequence of finite populations of size N = N„ 
= f~^n, where 0 </< 1, selected as random samples from a popuJation satisfying 
the assumptions of Theorem 3, and a sequence of samples of size n selected from 
these finite populations, then 

where A is defined in Theorem 3, 0 in Theorem 1 and 

1 ^^ . / 1 ^ . 



as in Theorem 1. 

Proof: Following the arguments ofTheorems 1 and 3, we have 

7i*{j?-/?) = niS^J(Sx.-S«|,-i2j/)+0p(n-*), 

1 ^ 

where ^Biv = ^ S x\,ei. 

Now 

Asymptotic normality follows by arguments analogous to those ofTheorems 
1 and 3. □ 
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If the covariancd matrix of the errors is not known, but is estimated, 
the variance of the estimated coeflRcient vector is increased accordingly. 
Corollary 3.2 : Let 



be an unbiassed estimator of the measurement error covariance matrix distributed 
independently of Sxt, . Let 

d^Suv = dHS^-^S^ufi) iV(0, 0) 

where d^^Tj-^^ n^i] a fixed positive number. Then, under the assumptions of 
Theorem 3 

where 

Proof: The result is immediate since 

$-j3 = t^[(t:r,-i:uv)-{Suv-^uv)] + 0^{n-'^) 

and and S^v independent by assumption. Q 

As in the case of regression estimation \nth no response errors, the proce- 
dures generalize to stratified multistage samj^les. If we assume that the 
respoase errors are independent between secondary units within the same 
priinaiy unit as well as between secondary units in different primary units, 
the estimation formulas are immediate generalizations of (5), (6), and (7). 
The estimator is • 

where the r^-th element of the estimated second moment matrix of X, 
and the 5-th element of the estimated cross moments of X and 1', 
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The estinicated variance of j$ is given by 



where the rs-th element of A is 



and 



V . 

1 '•^ - 

^ 1-1 

If one assumes linearit}*, homogeneity of error variances, and normality, 
compact expressions for the covariance matrix of the estimates are possible 
(see Fuller, 1971). 

To illustrate the computation of the errors in variables estimator, we 
use Frankel's (1971) data. A one-third systematic sample of the primaries 
of the original data was selected, and the log of the income of urban male 
heads of households aged 28-58 was regressed on age, age-squared, and years 
of education. This gives a sample of 952 primaries mth nonzero entries . in 
30 strata for a total sample of 4020 elements. Estimates are given in Table h 
The sample was treated as a stratified cluster sample in the computation of 
standard errors. In computing the errors in variables estimates, it was assumed 
that the response variances for age, {age-43)* and education were 0.3, 91, and 
3.00, respectively. It also was assumed that the response errors in the three 
variables were uncorrelated and that the response error in income was un- 
correlated with that in ttge and education. 

For age and age-squared, the response errors will be uncorrelated if the 
distribution of the response error in age is symmetric. The response error 
in age was estimated from data reported by Bailar (1968) and Palmer (1943). 
The response error in education was estimat-ed from U.S. Bureau of the Census 
(1972 p. 49). 
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The estimated standard errors for the least squares regression were com- 
puted hy using (7), and the estimated standard errors for the errors in variables 
regression were computed by using (8). 

The adjustment of the covariance matrix for response errors resulted 
in a lO-per cent increase in the estimated coefficfent for education, an amount 
approximately equal to three standard errors. The coefficient for age also 
increased considerably. These differences should be interpreted in light of 
the fact that age and education are among the items with the smallest response 
variances. 

TABLE 1. LOG INCOME AS A FUNCTIpN OF AGE AND EDUCATION. 



URBAN ilALE 


HEADS AGED 


28-58 






age 


(age-43)a 


education 


least squares regression 


.00213 


-.00048 


.0843 


(standard errors) 


.00116 


.00013 


.0032 


errora-ln-variables regression 


.00286 


-.00048 


.0946 


(standard errors) 


.00117 


00013 


.0035 



Hidiroglou (1974) has conducted a Monte Carlo study using the Frankel 
data. His results indicate that the large sample approximations were ade- 
quate for samples of 2 primaries per stratum for a population divided into 12 
strata. 
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Appendix A 

GEytRAL limit theorem for two-stage samples 
In this Appen'lix ^ve present a central limit theorem for the sample mean per 
primary unit Nof a two-stage sample from a finite population. The population mean 
per primary unit is defined to be the population total divided by the number of 
primary units in the population. Let {f « : n =z 1, 2, ...} be a seriuence of finite 
populations, where f„ contains primary sampling units, > N^^^. The Mh 
observation in thfe ^-th primary unit is denoted by 

r,i=:z/,+n4, ^=1,2, i=l,2, 
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whore M, is the number bf secondary units in the s-th primary, the j andom variable 
u, i? the 'f rimary component', and the random variable is the 'secondary compo- 
nent'. J i' 

For the s-th primary it is assumed that the e,t, t = l,2, M„ are a random 
sample from an infinite population with zero mean, variance tr^, and uniformly 
bounded 2+S moments ^ > 0. It is assumed that the vectors {v,, M„ o-?. ), s = 

1' 2 ^n' a random sample from an inOnite trivariate population with finite 

4-f-2^ moments. Let {g„,} be a fixed sequence such that 0 < < 1, define m, 
to be the smallest integer greater than or equal to gosM,, and set 



The primary sampling rate A = n/iV^, 0 </, < 1, is assiimed fixed,, T\ith the pos- 
sible exceptiou associated vriih the requirement that be integer. Let the sample 
mean per primary be given by 



n ,.1 " 

and the population mean per primary by 

where we hare suppressed the subscript on N, 
Theorem A : Gfiven the stated assmiptiona. 

Proof : We write 

1 Jf * I N Mg 



» *-i n ,.1 m, ,.1 N ,.1 ^ 

1 A ^* 1 J/ Alt 
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where /j, = mg/Ms and /*, = Jf,te,— ^{Jf^wJ. 

Now, the quantities, /*,+ S e,*, « = n+1, n+2, 2\r, are independently 
distributed with variance E{{Miut)^+Ms(Tl}-[E{M,Us}]^ and* bounded 
moments. Therefore, by the Liapanour Central Limit Theorem 

Af Af, 

« = 1, 2, n are independently distributed with variance 
and 

ft r 'n* ^ 

f 5 T-i--i- -* ^(0, 1). 

[«(l-A)»Si+ S^i?{(/2V-l)i^,£r«}] 

Since the two sums are independent for all /i, the result follows. □ 
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Survey sampling errors with OSIRIS IV 

Vinter, S., Ann Arbor, USA Session Cl /third paper 

Sampling methods 



Summary: the OSIRIS IV software system Includes two programs for computing 
sampling error estimates derived from surveys with complex sample 'lesigns. The 
&?SALMS command produces sampling error estimates for ratios and ratio means, 
totals and differences of ratios. The &REPERR command produces sampling error 
estlaatefc for means and regression statistics based on replication methods of 
variance estimation. Three alternative forms of replication methods are 
available: simple replication, balanced repeated replication, and jackknlfe 
repeated replication. The features of these two programs are described. 

Keywords: Balanced half-sample, complex sample design, jackknlfe, primary 
selection, repeated replication, regression, sampling error, stratification, 
subclass, Taylor approxlnatlon, variance estimation. 

INTRODUCTION 

Complex sample designs, such as stratified multi-stage sampling, are 
widely used In survey research. The computation of sampling errors for survey 
estimates needs to take Into account the type of sample design employed, and 
computer programs are required to enable the appropriate computations to be 
made. The OSIRIS IV software system, the current version of the OSIRIS system 
developed at the Institute for Social Research for the management and analysis 
of social science data, includes two programs for estimating sampling errors. 
This paper provides an introduction to these two programs, which are denoted 
by the &PSALMS and &REPERR commands in the OSIRIS IV system. The following 
sections describe the input data requirements, capabilities- and limitations 
of these commands, and in doing so details their computational algorithms, 
special implementation features, and the output they produce. 

Several techniques have been proposed for estimating sampling errors 
based on complex sample designs (see reviews by Kalton, 1977, and Shah, 1977). 
One approach approximates the variance of a statistic using the statlstlc's 
linear approximation, obtained by a Taylor's series expansion. This approach 
la most easily applied for relatively simple estimates and is adopted in the 
&PSALMS command to provide sampling error estimateri for ratios and ratios 
means, differences between such ratios, and totals. 

An alternative approach employs some form of the replicated variance 
estimation procedure. It is usually more costly to implement because of the 

COMPSTAT 1980 ©Phyiia-VerU*, Vienna for USC (International AssocUtion for Statistical Computing), 1980 
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expense of computlns a statistic for each replication. Hovever. the attraction 
Of this approach is the generality of Its application; it Is easily applied to 
a wide range of statistics due to the ease of computing the variance of a 
statistic frou replications. Irrespective of the complexity of the statistic. 
The SREPERR command uses the replicated variance estimation procedure to 
provlae sampling error eoclmate, for ..atls.lcs pro.Wed In multiple 
regression analysis. It accomodates three alternative forms of replicated 
variance estimation: simple replication, balanced repeated replication (brr). 
and Jackknlfe repeated replication (JRR) . 

Kish and Krankel (1974) and Woodruff and Causey (1976) have examined and 
compared the approximate variance estimates produced by the Taylor expansion 
method. BRR. and JRR. In general, all three methods were found to provide good 
Variance estimates. 

For computational ease and generality, both commands assume that primary 
selections are sampled with replacement. In practice, however, they are 
usually sampled without replacement. The false use of the 'with replacement' 
assumption leads to .^n overestlmatlon of the variance. But providing the 
first stage sampling fraction is small, the overestlmatlon is slight. The 
'with replacement' assumption yields a considerable savings in computing costs 
because the calculation of variance components from the subsequent stages of 
the design is unnecessary. 

INPUT DATA REQUIREMENTS 

The &PSALMS and &REPERR commands process OSIRIS datasets sequentially and 
use standard OSIRIS features for data retrieval, receding, and output, A 
dataset consists of a dictionary file and data file. A dictionary file 
contains records describing variable attributes. inclf^Mag column location, 
storage type, and missing data codes. A data file contains one record per case 
with each record comprised of variables in fixed column locations. Most OSIRIS 
dktasets are termed rectangular, being two dimensional with columns defining 
variables and each row representing a case. The commands can process OSIRIS 
hierarchical datasets, with data stored in a more complex structure, by 
reconfiguring the data into a- rectangular form during data retrieval. 

Required input specifications for both commands include the definition of 
a stratification variable and a sampling error comput/.ng unit (SECU) variable. 
These variables define the s.tructupe of the sample design to the commands. 

The atratif Ication variable divides the data into nonoverlapping strata 
that gcneraxly correspond to the stratification employed in the sample design. 
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Samplitv jnor estliaatlon requires at least two prlsary selections within each 
stratum ;hen the sample design does not provide this, the collapsed strata 
technlqi. (Cochran, 1977) needs to be applied before the stratification 
variable is defined* 

The definition of the SECU variable varies with different sample designs 
and variance estimation methods. It divides each stratum into unique, nonover- 
lapplng units, and Its formation Is fundamental to the estimation procedure. 
With multi-stage sampling, each primary selection could constitute a separate 
SECU (with single-stage element sampling, each element can be a SECU). 
However, If the primary selections are numerous and small. It may be 
advantageous to join several selections to form a SECU. Primary selections nay 
be combined within a stratum when a stratum contains many selections; 
alternatively, they may be combined across strata using the 'thickening zone' 
technique (Demlng, 1960). 

Different considerations apply between the two commands In determining 
appropriate definitions for the SECUs. The larger the number of SECUs defined, 
the greater the precision of the resulting variance estimators. Since the 
amount of computing In &PSALMS Is relatively Independent of the number of 
SECUs, generally as many SECUs as possible should b'e formed (1. e. each 
primary selection should be made Into a separate SECU). However, the amount of 
computations In &REPERJI Is dependent on the number of replications formed, and 
replication formation depends to a certain extent on the number of SECUs 
defined, particularly with the BRR method. Therefore, it Is sometimes 
desirable to employ a smaller number of SECUs for the &REPERR command. 

An important feature in both commands Is the capacity to account for 
empty SECU«» SECUs that correspond to one or more primary selections in the 
sample design but contain no valid cases (due to nonresponse or because the 
SECU contains no cases for a subclass during si^bclass analysis). In order that 
empty SECUt are not missed » all SECUs must be numbered consectively within 
each stratus, and the commands' setups must indicate the number of SECUs in 
each stratua. 

Several features Inherent to the OSIRIS IV systca increase the 
flexibility of the commands. The powerful recoding facility, a large set of 
functions for creating and temporarily mo'difying variables, interfaces 
directly with both commands. The stratification and SECU variables may be 
created with this facility. Both commands pennlt data weighting, allowing each 
case to be assigned a different welfcr;c. A filtering ca^blllty is Incorporated 
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In the commands to subset cases efficiently. Since cases are processed 
sequentially, they must be sorted by the stratification and SECU variables. 
The data can be presorted by the &COPYSORT -ommand, or the sort option can be 
selected in the commands to indicate the data are to be sorted before 
analys Is 

THE &PSALMS COMMAND 

&PSALMS uses the 'Taylor expansion method to estimate variances for ratio and 
ratio means, totals, and differences of ratios. Formulas for each statistic 
and Its variance are found In Klsh (1965). &PSALMS computes sums, sums of 
squares, and sums of cross products (SSQCP) for every varf.able In a ratio 
separately for each SECU. Missing data are deleted separately for each ' 
statistic under study. The variance and covarlance contributions for every 
strata are computed by accumulating SECU SSQCP within strata according to one 
of three models: paired selection, successive difference, and multiple selec- 
tion. Since stratum contributions are computed separately, different models 
may be employed for different strata In a single run. 

The paired selection model Is used for strata containing two SECUs. The 
successive difference model Is available for cases where implicit 
stratification is obtained by systematic sampling from an ordered list (Klsh, 
1965); generally each selection is designated as a separate SECU, and all 
selections together form a single stratum. If systematic selection is used 
more than once for different parts of the sample, each set of selections 
defines a stratum. Since the order of the SECUs in this model reflects the 
implicit stratification and hence, affects the stratum variance conputatlons , 
care must be taken to define SECUs in the order of selections employed in the 
systematic selection procedure. " The multiple selection model, like the 
successive difference model, permits any number of SECUs per stratum. This 
model is available as a general model when strata contains a varying number of 
SECUs. Every strata must contain at least two SECUs in all models. 

A missing SECU is entered as a zero in stratum contribution computations 
and does not increase the case count. Empty strata have no effect on overall 
variance estimation and are ignored. Printed diagnostics Include SECU and 
strata counts and a count and listing of missing SECUs and strata. 

A useful facility for subclass designation in &PSALMS enables a single 
run to yield sampling errors for a range of. estimates for both the total 
sample and an unlimited number of subclasses. 

The statistic. Its standard error and variance, simple random sample 
standard error and variance, design effect, Intraclass correlation, and case 
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and weight counts are included in the printout. The adequacy of the Taylor 
expansion variance estimate depends on a small coefficient of variation for 
the denominator for ratios as ratio raeans; the &PSALMS printout includes the 
confident of variation estimate and a warning when it is determined to be 
too high. &PSALMS permits the naming and numbering of each statistic under 
study for clear printed output, in addition to the inherent system features of 
automatic pagination and paging titling dating, and nuTTjberlng. Additional 
printout options include variable sums and sums of squares for each SECU, a 
summary and extended table, and contributions to the variance for selected 
domains of analysis (sets of strata). 

THE &REPERR COMMAND 

&REPERR computes variances for means, correlations, regression 
coefficients, standardized regression coefficients, and multiple correlation 
coefficients using replication techniques. Replications, subset of SHCUs that 
represent the overall population, are formed using a balanced replication 
procedure, jackknlfe procedure, or from user-specified lists of SECUs. Only 
one model may be used in a mn. 

Processing begins by combining sums, sums of squares, and sums of cross 
products of variables for each SECU to form replication totals in accordance 
with the selected model. Correlation matrices are computed and standard 
regression analyses are performed for the total sample and each replication. 
Estimates for each statistic from every replication and the total sample are 
combined to produce variance estimates (Frankel , 1971). 

The BRR model requires that each stratum contains exactly two SECUs. One 
SECU Is selected from each stratum to form a replication. This operation is 
repeated to fonn a set of replications thrt have the property of orthogonality 
(Klsh and Frankel , 1970). This property is applied by using the procedure for 
producing the orthogonal matrices devised by Plackett and Burman (1946). The 
number of strata that can be accomodated by this procedure is limited to the 
range from 4 to 88. The complements of replications are also included as 
replications and used In variance computations, as discussed by Frankel 
(1974). 

The jackknlfe model can be applied with any nxanber of SECUs per stratum 
greater than one. A repUcatioa is formed by Including all SECUs In the sample 
but .one, and appropriately weighting the remaining SECUs In the stratum from* 
which the one was deleted. This procedure can be repeated by deleting each 
SECU in turn, so that the total number of replications is equal to the number 
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of SECUr In Che sample. If che costs of che analyses are too great using the 
total number of replications, a reduced set may be created by randomly selec- 
ting two SECUs in each sCraCum for deletion, thus forming two replications per 
stratum. 

As an alternative to che BRR and JRR methods, the user may define 
replications by listing and weighting the SECUs to be included in each 
replication. This general model allows flexibility in replication formation 
and can be used for simple replicate sampling or methodological comparison of 
deferent replication procedures. 

Missing strata are illegal r.nd checked by the command. Missing SECUs are 
Counted and documented, and processing continues with missing SECUs assumed 
empty and not contributing to replication totals. 

Standard printout includes che statistic, its standard error, simple 
random sample standard error, design effect, and test stacisclc (corresponding 
CO the t-3tatistlc under assumptions of simple random sampling and normality). 
Print options available include: the listing of SECUs and weights by replica- 
tion; SECU and replication univariate statistics, suras, sums of squares and 
cross products; and regression analysis by replication and for Che coCal 
sample. A feaCure is available Co creaCe dummy variables from caCegorical 
measure for use as independenC variables in regression analysis (see Draper 
and Smith, 1966, and Kish and Frankel, 1970) 
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Appendix 2 



Decription of OSIRIS prog7;am package 



INSTITUTE FOR SOCIAL RESEARCH 



Su^rvty Reatareh Center 
Computer Support Group 



May 1980 
Phone: (313) 76A-»AA17 



OSIRIS IV: 



Statiitical Analysis and Data Management Software Syatetn 



SYSTEM OVERVIEy 



OSIRIS IV is the current version of a software package which has been evolv- 
ing for nany years at the Institute for Social Research, University of Michigan. 
It oakes use of the latest practical knowledge to reduce costs and provide 
increaaed capacity and flexibility in the areas of data management and statistic 
cal analys/.a. OSIRIS IV is designed to serve a broad c jmunity of users and has 
facilities for handling data collected for a wide range of purposes. In addi- 
tion to the usual basic atatistics and functiona» such as cross-^tabulations and 
classical regression and correlation analysis, several special techniques are 
available for handling nominal** and ordinal^^scale data and for calculating saa 
pling errors for complex designs . OSIRIS IV also has a full range of well inte* 
grated data mazugem^nt facilities; of special interest are the ability to handle 
weighted datai a pomrful general purpoae recording facility, matrix input and 
output, and hierarchical datasets with variable length records. Virtually any 
mode of data can be used directly in OSIRIS IV for up to 32,000 permanent 
variables. Among its other capabilities are facilities for: 

Online command documentation and indexing. 

- Interactive aetup interpretation 

- Storing, retrieving, and modifying information about the structure of a 
dataset 

- Oiaplaying data 

- Editing and correcting data 

- Copying and aubsetting data 

- Traasfosaing data values, through arithmetic and logical operations both 
vlthln and aeroaa records 

- Generating univariate and bivariate frequency distributions and related 
statistica 

- Produeii^g acatter plots 

- Performing moltiple regression analysis 

- Performing univariate and multivariate analysis of variance 

- Condueting inalysea with multiple nominal- or ordinal-^scale dependent and 
independent variables 

- Searching among predictors for the greatest variance explanatory power 
(AIO/ SEARCH} 

- Factor-analysing data 

- Performing cluater analysis. 

- Multidimensional scaling. 

Additional commanda and documentation, developed within the Center for Folic- 
ical Studies to supplement and enhance the data structures capabilities of 
OSIRIS IV, include: extended capabilities for adding new and derived measures to 
a data structure, and procedures for subsetting the data structure; a capability 
for establiahing what elements are present in the structure and in what 
proportion; and apecial doctimentation describing the generation, modification, 
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and use of hisrarchical daca scruccures* 



Standard and consiscenc paramecer keywords sake OSIRIS IV easy co learn and 
use, and ainimize che size and complexlcy of the required documentation. To use 
OSIRIS IV, the user supplies che following items, as appropriate: 

OSIRIS IV comoands indicating which functions are desired and providing 
Instructions to the OSIRIS IV nonicor 

- Data formatted either as an OSIRIS IV dataset or matrix 

- Recode statements creating new variables or transforming existing ones 

- Entry definitions indicating how groups of variables ere to be assembled 
fxTom a structured file 

- Control statements specifying variables and parameters, optionally defining 
a subset of the data to be processed 

Every variable in an OSIRIS IV dataset has a number and a fixed set of attri- 
butes associated with it: attributes such as the location of the variable within 
each record of the data file, the variable width, type, number of decimfil places 
and the values to be treated aa missing'^data. This information is stored in a 
dictionary file, one record per variable. Once this information has been rec« 
orded in the dictionary, it need not be respecified by the user. Variables are 
then referenced in OSIRIS IV by their associated variable numbers. Dictionaries 
may easily be created or revised with the &DICT command*. 

Variables may be stored in a variety of modes: 

alphabetic 
character numeric 
floating-point binary 
integer binary 
packed decimal 
zoned decin2il 

The storage mode of the variables need not be of concern except when first 
entering data into the OSIRIS IV system, as OSIRIS IV data management conooands 
can handle data in any of the modes , and analysis commands can process all modes 
except alphabetic. OSIRIS IV commands which create new datasets will use the 
most efficient mode; however, 4TRANS may be used to alter the storage mode of 
any or all nonalphabetic variables, in a dataset. 

Tt should he notrd that HSTRIS III datasets ire compatible with OSIRIS IV ; 
OSIRLS IV. can both read and create datasets for uae in OSIRIS III. 

OSIRIS IV datasets have two possible configurations: 

a. rectangular : all variables for one data case are stored in one record and 
each variable occupies the same relative location within each record: 

VI V2 V3 VA 

CASE 1 I I I I I I 

2 I I I I I I 

3 I I I I I I 

4 I I I I I I 
. I i I I i I 
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b. ttructured ; variables are collected into "groups" each with its own rec- 
ord length: 

CROUP 1 I VI IV2 |V3 IVA |V5 I 

CROUP 2 1 V6 IV7 IV8 I 

GROUP 3 I V9 IVIO I 

GROUP 4 I Vll IV12 IV13 IVlA IVIS iV16 I 

Selected variables from different groups may Joined together to create a 
rectangular record for ^ given run. 

The flrat atep in the creation of an OSIRIS IV rectangular d^^caset is to 
build an OSIRIS IV dictionary file, uaually via the SDICT command. Once the 
dictionary haa been created the data may be read directly by OSIRIS IV vlchouc 
any special *'file building." 

A atructured dataaet is built from individual rectangul;ir files via the 
&SBUILO command* Thia type of dataset is used where there is not only a rela- 
tionship between variables within a rectangolar dataset, but also a relation- 
ship, usually hierarchical, between the various datasets. Each rectangular 
dataaet becomes one or more groups in the atructured dataset. A simple example 
is one rectangular dataset containing only houaehold data and another containing 
additional data for individual membera of each houaehold merged by &SBUILD into 
a aingle atructured dataaet. 

When a atructured dataset is used in OSIRIS IV, the user gives instructions 
via the &ENTRY command as to how the groups are to be rearranged to create Tem- 
porary rectangular recorda called **entrie8." This restructuring of the groups 
permits analyaia to be performed on a %rider range of entriea than is possible 
with simple rectangular recorda and can thereby aave numerous data management 
atepa. The dictionary for the atructured dataset may contain a default entry 
definition which ia uaed to restructure the dataset when no other inatructions 
are given via the 4EMTRY command. 

The advantage of a atructured dataaet ia that for many data files, a more 
efficient atorage mode is achieved i\% terma of apace and cost of processing. 
Thlj storage technique la more flexible and powerful than other data atorage 
techniques, and permits larger datasets to be analyxed than in other syatems. 

OSIRIS IV is an open ayatem; it ia relatively easy in fflos.t instances to read 
data which are atored in character or binary form directly into an OSIRIS IV 
command. Another facet of the the ayatem' a openneaa ia the ability to take any 
OSIRIS IV dataaet and reformat it for uae by other aof tware. Thus, it is rela- 
tively eaay to move outside the system; the data are not locked into OSIRIS IV. 
Finally, the uaer may add programa which uae OSIRIS IV aubroutlnes and hence use 
the common control atatement language and OSIRIS IV dataaeta. Thua the software 
may be augmented to meet the uaer* a apecial needa. 

In addition to the baalc uaer manual **OSIRIS IV: Statistical Analysis and 
Data Management Software Syatem," there are many related publications of the 
Institute for Social Research that can serve aa uaeful aupplementa. A list of 
these ia attached. 

HARDWARr REQUIREMENTS 

The hardware requlrementa for OSIRIS IV are an I3M 360 or 370 computer, or an 
iaH*compatlble machine auch aa an AMDAHL 470 V/6, with at leaat ISOR bytea of 



main storage , the equivalent of 1000 to 3000 tracks, 729A characters each, of 
disk work space, and sufficient peripheral devices for user input and output 
files. The computer must be operated under MTS, the OS/360 or MVS operating 
system, or equivalent. 

A periodic newsletter will be sent to all installationa to conaunicate infor-* 
mation about OSIRIS IV and its use. Finally, OSIRIS IV is not a static system; 
significant resources are being invested in its improvement and extension, and 
updates or new releases will be issued as changes are made. Commands currently 
being developed or planned include &AGGREG (aggregation), &CBKLIST (codebook 
listing), and &SPSS7ILE (input SPSS files). 



SYNOPSIS OF COMMANDS IN OSIRIS IV 

P reparing Data for Input 
&COPYSORT 

Copies, reblocks and/or sorts OSIRIS IV and non-OSIRIS IV dataaets. 
&DICT 

Creates, corrects, modifies > or adds to existing dictionaries, and adds code 
category labels to a dictionary. 

^MATRIX 

&MATRIX is used to enter one or more matrices into OSIRIS IV. Each matrix is 
assigned a unique number which is used to reference it in subsequent com* 
mands. Matricea which have been created by OSIRIS IV may aimply be entered 
following an &MATRIX command. Matrices which have been created by other sys- 
tems may alao be read by providing the appropriate control atatementa. 

Checking and Correcting Dacasets 

&CONCHECK 

&CONCHECK uaed in conjunction wich &RECODE provides a consistency check capa- 
bility to teat for illegal relation^hipa between values for groups of varia-* 
bles. &CONCHECK takes user specif icationa indicating data inconaistenciea 
from tests made in &RECODE and displays information allowing the user to 
locate each inconsiatency. &RECODE and 4TRANS or &FCOR can then be uaed to 
correct the inconaistenciea. 

&FCOR 

&FCOR provides file correction capabilities for rectangular i'^^ atructued 
OSIRIS dataaets. It corrects values for any of the varlabliss^ iiv any data 
case, adda a complecely new record, or deletes an old one. 

&MERCHECX 

&MERCHECK detects and corrects merge errors for unit*racord dataaeta (e.g., 
carda) auch aa miasing decks, duplicate decka, or invalid datacarda in data- 
seta. The command producea a file in which each data caae haa the same 



atrueture: a perfect merge of decks* The data for studies Involving one deck 
of Infonaatlon per case should also be subjected to &MERCHECX because 
&HERCH£ac will detect and correct multiple appearances of Input data cards 
for any given case ID value, and will ensure that no cards foreign to the 
study have been Included. 

&WCC 

&VCC verifies vhether a set of variables has only legitimate data values and 
lists all invalid codes by case ID and variable number. Once the bad code 
values have been identified, they may be corrected with &FCOR. 

Displaying Datasets 

&DSLIST 

&DSLIST is used to print a dictionary and/or a subset of variables from asi 
OSIRIS dataaet* &DSLIST is sometimes used to ll'st temporary &RECODE result 
variables to cheek their correctness* A variety of formats is available. 

Building and Modifying Structured Datasets 

SENTRY 

With most structured files, the groups of variables created by &SBUILD can be 
combined in several vays. Each distinct combination of groups forms an 
entry* a single set of variables corresponding to a "case** in a rectangular 
file« 4EMTRY allows the user to define or redefine the entry to be formed 
from the groups in the structured file, allowing the user to specify how the 
structured file is to be rectangularlaed* 

&SBUILD 

&SBUILD builds an OSIRIS IV structured dataset from one or more rectangular 
datasets* The basic unit of a structured dataset is a collection of related 
variables called a "group*'^ A group has the same characteristics as a rec- 
tangular dataset: all the records are the satae length and each variable is in 
the same relative location within each record. However, a structured dataset 
may contain many dlffi2rent groups, each with its o%m set of variables, and 
some logical relationship which ties them together* 

&OPDATE 

4UPDATE builds or updates OSIRIS IV rectangular or structured datasets from 
one or more OSIRIS IV rectangular or structured datasets. &UPDATE can add, 
delete, or replace cases or variables in a rectangular dataset, and add, 
delete, or replace groups, records, or variables in a structured datasi^n. 

Transfomlng Datasets 

&MATRAMS 

&MAIRANS is used to change the type of a matrix, print a matrix, subset a 
matrix, and change variable numbers and nawes in a matrix. 



&RECODE 



A powerful recoding and variable cransfonnaCion feature la available vlch 
• almoac all OSIRIS IV analysia and daca managemenc coomanda. The Recode faci- 
lity can create nev variables from any arichmecical combination of existing 
variables; can bracket or recode variables according to specified tables; and 
has several special features such as creating '*dummy variables*' and 
combination variables. In addition, a modest amount of aggregation and 
disaggregation may be accomplished via &RECODE. 

&TRANS 

&TRANS creates a rectangular OSIRIS dataset from specified input variables. 
&TRANS can convert the mode of the variables , and can also change the dic- 
tionary type for compatibility vith other systems. It also allows the sub-* 
setting of cases. Additionally, &TRANS can be used to insert nev variables 
created or modified by &RECODE into the dataset, thereby making permanent 
copies of them. 

Frequency Distributions and Associated Statistical Measures 
&SCAT 

&SCAT is a bivariate analysis command which produces scatter diagramSi uni- 
variate statistics, and bivariate statistics. The scatter diagrams are plot- 
ted on a rectangular coordinate system; for each combination of coordinate 
values that appears in the data, the frequency of its occurrence is dis- 
played. &SCAT is particularly useful for displaying bivariate relationahips 
if the numbers of different values for each variable are large and the number 
of data cases containing any one value is small. If, however, a variable 
assumes relatively few different values in a large number of cases, ^TABLES 
is more appropriate. 

STABLES 

&TABLES produces univariate or bivariate frequency tabulations tnd per^ 
centages, and univariate statistics by stratum* (For univariate statistics, 
see also &USTATS.) &TABLES may also be used to produce quantiles and several 
nonparametric measures of association and significance for ordinal or nominal 
data* The Mann-Whitney U, the Knaskal-Vallis H, gamma, Kendall's tau a, b, 
c, lambda, lambda a, lambda b, Leik-Gove's D for nominal data (corrected), 
chi-square, Cramer's V, C-square, Cini coefficient and Lorens plot, Goodman 
and Kniskal's tau a, b, and Cohen's Kappa. 

&USTATS 

&USTATS computes means, standard deflations, and fflinimuffi and maximum values 
for a given set of variables. Optionally, it will compute the same statis- 
tics for each variable for each specified subset. 

Correlation and Regression Analysi s 

&MDC 

4KDC computes Pearson product- moment correlation coefficients for all pairs 
of variables in a list, or for all combinations of variables, one of which la 
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from one list and another of vhich is from a second list or for selected 
pairs of variables. &KDC controls for input missing-data in one of two 
different ways: pair-vise or case^vise* 

• &PARTIALS 

The n-th order partial-correlation coefficient (partial r) and the standard- 
ised partial regression coefficient (beta) are computed for each pair in a 
set of variables, holding all other variables constant* In addition, the 
multiple correlation coefficient (R) ia computed for each variable using all 
the other variables as predictors. &PARTIALS is used in conjtmccion with 
&KATRIX or &MDC. 

&MCRESSN 

&RECRESSN will compute standard or step-wise multiple regressions with or 
without a constant term. It will accept interval or categorical (dummy) pre- 
dictors. With the step^wise option, predictors may be forced into the 
regression before the step procesa begins. &IIECRESSN will take as input an 
OSIRIS IV matrix or dataset. The latter may be weighted or unweighted and 
will be subject to a case-wise missing^data deletion. &SECRESSM may be used 
to **partial out** a subset of the predictors and print the remaining partial 
correlation matrix, prior to running the multiple regression with the full 
set of predictors. 4R£CRESSN will produce &RECODE control statements for 
computing residuals, if requested. 

Analysis of Variance 

4AH0VA 

&ANOVA is a one-way analysis of variance command which performs an unlimited 
number of analyses using various independent and dependent variable pairs. 
ftANOVA will produce &RECOOE control statements for computing residuals, if 
requested. 

4MAM0VA 

iHAMOVA performs univariate and multivariate analyses of variance and covari- 
ance» using a general linear hypothesis model. Up to twelve factors (inde- 
pendent variables) can be used. If more than one dependent variable is spec- 
if ied« both univariate and multivariate analyses are performed. &MANOVA per- 
forms an exact solution with either equal or unequal numbers of observations 
in the cells.. 

Multivariate Analysis Using Ordinal and Nominal Predictors 
tPRgC 

40REC provides a maximum likelihood regression capability for a dichotomous 
dependent variable using either a linear or logit model. 4DREG may also be 
used to analyse multiway contingency tables whenever one dimension can be 
thought of aa a dichotomous dependent variable • 

4MCA 

4MCA examines the relationships between several categorical independent vari- 



ables and a tingle inttrval scaled dependent variable, and determines the 
effects of each predictor before and after adjustment for its intercorrela- 
tions with other predictors in the analysis. It also provides information 
about the bivariate and multivariate relationships between the predictors and 
the dependent variable. See Andrews, et al, Multiple Classification Analy- 
sis , for a complete description of the methodology used. &MCA will produce 
&RECODE control statements for computing residuals, if requested. 



&MNA performs a mxiltivariate analysis of nominal*scale dependent variables, 
vniil'i the MCA technique described above assumes interval measurement of the 
dependent variable and an additive model, &MKA is designed to handle problems 
where the dependent variable is a nominal scale, the independent variables 
may be measured at any level, including nominal, and where any form or pat- 
tern of relationship may exist between any two variables. The program uses a 
series of parallel, dummy variable regressions derived from each of the 
dependent variable codes, dichotomized to a 0*1 variable. 



&SEARCH searches among a set of predictor variables for those predictors 
liihich most increase the researcher's ability to account for the variance or 
distribution of a dependent variable. The question, '*what dichotomous split 
on which single predictor variable will give us a maximum improvement in our 
ability to predict values of the dependent variable?,** embedded in an itera- 
tive scheme, is the basis for the algorithm used in this command. &SEARCH 
divides the sample, through a series of binary splits, into a mutually exclu* 
aive series of subgroups. Every observation is a member of exactly one of 
these subgroups. They are chosen so that, at each step in the procedure, the 
split into the two new subgroups accounts for more of the variance or distri- 
bution (reduces the predictive error more) than a split into any other pair 
of subgroups. The predictor variables may be ordinally or nominally scaled. 
The dependent variable may be continuous or categorical. &SEARCR is an elab- 
oration of the OSIRIS III AI03 and THAIO programs. 

Factor Analysis and Multidimensional Scaling 



4C0MPARE is based on Schonemann and Carroll's procedure for '^fitting one 
matrix to another under choice of a central dilation and rigid motion." The 
technique rotates one configuration (the problem space) to the space of the 
other configuration (the target space) to achieve a least-squares fit. In 
seeking the best fit^ the rotation is a **rigid motion,** which maintains the 
orthogonality of the axes* A typical application is to compare the configur- 
ations produced by non*metric scaling analysis and factor analysis from the 
same data. 



&FACTAM provides a general factor analysis package that includes numerous 
options for the application of various factor analytic tools currently in 
use. Separate factor analyses may be perforaed on various subsets of varia- 
bles in a sintle run. 



&MNA 



&SEARCH 



&COMPARE 



4FACTAH 
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4MINISSA (Itlchigan 2«tatl Netherlands ^'^tegrated ^malleat ^pace Analysis) is 
a nonaeerTc muleiZimenai^nal scaling command. The input to &MZNISSA is a 
MCrlx of siailariey or dissimilarity coefficients (e. g. , Pearson's r) , the 
output is a geometric representation of the matrix in m dimensions. &MINISSA 
conetructs a configuration of points in space using information about the 
order relations among the coefficients. Because it is usually possible to 
satisfy the order relations of the coefficients in fever dimensions than 
vould be necessary to reproduce the metric information, the technique is 
called smallest space analysis (SSA). 

Cluster Analysis 

tCLUSTER 

&CLUSTER performs hierarchical cluster analysis. With input consisting of a 
symmetrical matrix of measures of sifflilarlties or disaimilarlties, &CLUSTER 
successively partitions the dataset into a set of clusters as determined by a 
clustering criterion. Clustering methods include the minimum and maximum 
methods, the central vectors and coefficient alpha method for similarities > 
and the centroid distance and mean square error methods for dissimilarities. 

Sampling Error Analysis 

tPSALMS 

Using the Taylor series approximation method, &PSALMS computes estimates and 
sampling errors for ratio means and totals for stratified clustered sample 
designs. APSAUfS accesses both weighted and unweighted data, and does not 
assume a siaple random sample was taken. &PSALMS will optionally calculate 
sampling errors for parameters on subclasses of the dataset. 

&REPERR 

4RCPERR computes estimates of regression statistics and their estimated samp- 
ling errors fo^ data from clustered sample designs using repeated replication 
techniques. Replications are created using one of three methods: balanced 
half*'sample > jackkaife t or user-specified replications. 

Suppltmental Hierarchical Data Structure Support, developed by the Center for 
Political Studies Computer Support Group " 

AMERCE 

4MBRCS modifies an OSIRIS IV hierarchical dataset. &HERCE will correct vari- 
ables, add new variables, add, delete, replace, or list occurrences (rec- 
ords), and selectively join two structured datasets together. 

tSORTFLD 

ASO&TfLS provides specific information about the sort fields in an OSIRIS IV 
structured dataset* This information includes a description of the struc- 
ture, a display of the sort fields, and an analysis of the occurrence of the 
logical pairs of groups in the data. 

H 
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&STOANS 

aSTKANS will subset a structured dataset, keeping the structure intact. It 
is especially useful for creating a random subset of a very large dataset. 
&STKANS also penaits new and derived measures to be added to a structure* 
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MAJOR DiyFERBICZS BETWEEN OSIRIS IV AND OSIRIS III 



Dtta Piles 

a) HierarchlcAl files vlth variable-^lengch records may be created and used 
In OSIRIS IV. Such files can save space and execution tine, and add 
flexibility for large-scale research data bases • 

b) Virtually any kind of storage aode aay be used for the data, including 
character, Integer and floating-point binary, packed and zoned decimal. 

c) Leading blanks and decimal points are permissible in the data. 
Dictionary Files 

Dictionary- files are "type 5" in OSIRIS IV by default (required for hier- 
archical files); however, OSIRIS III dictionaries may be used without change 
in OSIRIS IV. 

Codebook Records 

a) "L'* cards can b# used to provide category labels for use in OSIRIS IV 

b) OSIRIS III codebook records toay be brought in and out of OSIRIS IV. 
However, OSIRIS III codebook records with text fields longer than 56 charac* 
ters will cause additional records to be generated, in order to make room 
for the group number and expanded variable numbers required in OSIRIS IV. 
These sdditional records will be collapsed back into their original form if 
the dictionary is later converted back to OSIRIS III format %rlth the &OICT 
command . 

Matrix Files 

OSIRIS IV matrix files such as created by &MDC and SFACTAN are automatically 
available to subsequent cofsmands such au 4R£aRESSN AND &FACTAN. 

RECODE 

a) 4REC0DE statements must appear before the command which will use them. 

b) Any isode reeode may be used with any command. Decimal data will be 
rounded when integer mode RECODE is used. 

c) Alphabetic rect^-iling is possible. 
"INTEGER" Programs 

The distinction between "INTEGER" and "FLOATING-POINT" programs has been 
dropped. Camnnds for which only integer values are appropriate will auto- 
matically round decimal data to the nearest integer as needed. &RECODE 
could be used to scale such values to simulate OSIRIS III if desired. 

Global Filters 

•Numeric and alphabetic variables may ba used, 
•tieadlng seroea do not have to be punched. 

Re 
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^Decimal polncs must be used as indicated by the dictionary. 
"^Filter statements can be of any character length* 
"^Parentheses may be used. 
*-The symbols > and < may be used* 

8. Printout 

^Page titles may be up to 100 characters long* 
"-Output is page-* numbered and dated* 
-Printout is more compact* 

9* REPETITIONS 

The use oif REPETITION factors, which allow several analyses to be performed 
for different subsets of the data, has been greatly expanded and provides a 
facility analogous to "packets'* in OSIRIS III* 



NOTE: OSIRIS III remains available to users, as a software package with a vari- 
ety of features not incorporated in OSIRIS IV* OSIRIS III is, however, a sta- 
bilized system with no further development currently underway* 
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DISTRIBUTION POLICY 



OSIRIS IV is distributed on a not-for-profit basis by the Survey Research 
Canter Computer Support Group (SRCCSC) at the Institute for Social Research, 
University of Michigan. A first-year fee and a subsequent yearly rental fee, 
spteifled in the order form, are charged to cover distribution^ aaintenance, and 
development costs. At the end of the first year, SRCCSC will automatically bill 
for the next 12 month period and continue to do so until a invoice is returned 
with a note asserting that OSIRIS IV is no longer in use. 

I. Materials sent 

Upon receipt of the order form and signed distribution agreement, complete 
with payment, SRCCSC will send the following materials to the requestor: 

]. . OSIRIS IV, including source modules, load modules which are intended to 
run under OS, MVS, or their equivalent on an IBM/360 or IBM/370, and 
installation implementation instructions, all on magnetic tape. 

2. One (1) copy of the OSIRIS IV manual. 

3. One (1) copy of the OSIRIS IV subroutine manual. 

4. One (1) copy of OKDS: An Introduction to the OSIRIS Hierarchical Data 
Structures Capabilities. 

II. Maintenance and Consultation 

1. Bnhancementa, updates, and improvements to OSIRIS IV, including new 
releases, when and if developed for public distribution, will be sent 
sucomatically to all current OSIRIS IV installations. 

2. Considerable effort has been made to make OSIRIS l\ as trouble free to 
use and implement as possible, but should difficulties arise, a reasons*- 
ble amount of free consultation by phone or letter will be available, 
with all phone charges to ba paid by requestor. 

3. Extended consultation on the use of OSIRIS IV may be available at 
$2SoOO/hour or by special arrangement with SRCCSC. 

A. Maintenance and consultation will be provided only for the current 
release of OSIRIS XV, and may, but need not, be provided if the OSIRIS 
IV installation has modified or changed OSIRIS IV. 

5. In the event of the loss or deatruction of the installation's copy of 
the current OSIRIS IV, SRCCSC will replace it at a reasonable charge. 

6. If serious errors are discovered, a revised tape will be sent out. 

III. Distribution Agreement (repeated on back of order form) 

1. The OSIRIS IV aystem is to be uaed by the requesting installation only 
on a single COMPUTER SYSTEM and for its own use, except as noted in 
section 3 balow« The term single COMPUTER SYSTEM encompasses a multi- 
processor system wherein the processors are located on the same site, s 
system wherein terminals are located off site, or other back-up system 



locacad on che same sice. Re-discribucion of OSIFJS IV in whole or in 
part or any derivative thereof, externally or internally , to other 
computer systems or sices is prohibited. 



2. No cide or ownership righcs co OSIRIS IV are cranaf erred by chia 
agreemenc . 

3. If a coaaercial inscallation makes computer time which uses OSIRIS IV 
available to any other user, then the requesting installation agrees 
that it will pay SRCCSG', within thirty (30) days after the end of each 
calendar quarter, a 10% royalty on all charges to such other users made 
by the installation for machine related services for each computer job 
which utilises OSIRIS IV. 

4. All payments are exclusive of any tariffs, duties or taxes imposed or 
levied by any government or governmental agency. The requesting instal* 
lation shall be liable for payment of all such taxes however designated,- 
levied, or based on OSIRIS IV, its use, or on this agreement, including 
without limitation, state or local sales, use, and personal property 
taxes • 

6. While OSIRIS IV has been carefully developed and tested for accuracy and 
proper functioning, SRCCSG, the Survey Research Center, the Institute 
for Social Research, or the University of Michigan cannot guarantee the 
accuracy or correctness of OSIRIS IV. 

7. In no avent shall the SRCCSG, the Survey Research Center, the Institute 
for Social Research, or the University of Michigan become liable to the 
requesting Installation, or any other party, for any losa or damages, 
consequential or otherwise, including but not limited to time, money, or 
goodwill, arising from the use, operation or modification of OSIRIS IV 
by the requesting installation. 
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OSI&IS IV Olscrlbution 

Cooputer Support Group 

Survey Resiitareh Center 
Institute For Social Research 

Unlveralty of Michigan 
Ann Arbor^ Michigan 48106 

OSIHIS IV ORDER FORM : Conplete BOTH sides and retuxm to above address. 

Shipping Address: 



Raoe 



Flm/ Institution 

Street ^ 

City 



State 



Zip 



Country 



Telephone NOi 



This request sust be accoapanled by a check or purchase order. The current 
rates available are aa foUovs: 



first 


aoaual 




renewal 


$ 2400 


9 1800 


$ 1600 


$ 1200 


$ 1200 


$ 900 


$ 900 


$ 675 



basic fee 

for govemaent agencies and non«*profi^ Institutions 
for institutions granting academic degrees 
for Inter-\miversity Consortlxxa for Political and 
Social Research (ICPSR) tuembers 



Overaeaa ordera: add $2S 
1^1 Check No. 



I I Purchase Order No. 



Aisount $ 



OSISIS IV TAPE - indicate desired density: 

9 track EBCDIC • odd parity tape written at 

O O 1600 BPI l~| 6250 BPI 



CHECK. ONE CATEGORY: 


HARDWARE: 


1^1 Degree Granting Inatitution 


Manufacturer 


l^i Goveraaent or Non-^profit 


Model No. 


y^l Service Bureau 


Mmorr aize 


l~l Other: 


Ooerating System 





FOR SRCCSG USE ONLY: Date Received Version Date Copied Date Mailed 
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DISTRIBUTION AGR£gHg»T 



1. ' Thft OSIRIS IV syscmi Is Co b« used by eht requesting Inscsllaeldn only on a 

single COKPOTER SYSTEM end for Its ovn use^ except es noted In section 3 
b«lov« The term single COMPUTER SYSTEM enconpesses a multi-processor aystes 
wherein the processors ere located on the sane site, a system wherein 
ceralnals are located off site, or other back<*up systea located on the same 
site. Re-dlstrlbutlon of OSIRIS IV In whole or In part or any derivative 
thereof, externally or Internally, to other computer systems or slt^s Is 
prohibited* 

2. Ho title or ownership rights to OSIRIS IV are transferred by this agreement. 

3. If a coonierclal installation makes computer time which uses OSIRIS IV 
available to any other user, then the requesting installation agrcicts that It 
will pay SRCCSG, vlthln thirty (30) days after the end of each calendar 
quarter, a lOZ royalty on all charges to such other users made by the 
Installation for machine related services for each computer Job which 
utilizes OSIRIS IV. 

4. All paymenee are exclusive of any tariffs, duties cr taxes Imposed or levied 
by any government or governmental agency. The requesting Inaullatlon shall 
be liable for payment of all such taxes ho%fever designated, levied, or based 
on OSIRIS IV, Its use, or on this agreement. Including without limitation, 
state or local sales, use, and personal property taxes. 

6. While OSIRIS IV has been carefully developed and tested for accuracy and 
proper functioning, SRCCSC, the Survey Research Center, the Institute for 
Social Research, or the University of Michigan cannot guarantee the accuracy 
or correctness of OSIRIS IV. 

7. In no event shall the SRCCSC, the Survey Research Center, the Institute for 
Social Research, or the University of Michigan become liable to the 
requesting Inatallatlon, or any other perty, for any lose or damages, 
consequential or othervlse. Including but not limited to time, money, or 
goodwill, arising from the use, operation or modification of OSIRIS IV by 
the requesting Installation. 



The terms of this agreement are understood and accepted for the requesting 
installation by: 



Name^ 
Title 



Signature^ 
Date 
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OSIRIS IV AND RELATED DOCUMEHTATION 

Itmn liotad bclov aay bt ordered using the Attached order forn* 

OSIRIS 17: Stetiitieel Aiulyii end Pete Menegement Softvere Sreten 

A thorough deeerlptlon of eeeh eooBuind end the overall eyecea* The vrice-up for 
each eoBoand Itieludea a g^n^^Al deaerlptlon» uaeSt functional relatione to other 
eanoftanda» extended ezplanationa of optiona and featurea» reatrictiona» input and 
output requirement a » and aample aetupa* 1979* 250 pageai ring binder. Price: 
$15*00 ($9.00 for over-the-counter caah purcheae from ISR supplies) • 

OSIRIS IV Subroutine Hanual 

The aubroutinea in the OSIRIS IV library are described in thia manual. The 
functional characterlatica are detailed aa veil aa all entry pointa and calling 
sequences* This manual ia useful to thoae viahing to modify exiating OSIRIS IV 
conmanda or to add new commamda* 1979* 160 pages » ring biader* Price: $13«00. 

OSIRIS III> Vol* 5t rormulaa and Statiatical References by Laura Rlem* 

Although an OSIRIS III reference, thia volume ia reaaonably applicable to the 
correaponding OSIRIS IV eommanda. An OSIRIS IV veraion ia planned. 1974* 212 
pageat looae-lea£» ahrink wrapped • Price: $8.00. 

Searching for Structure by John A* Sonquiati Elisabeth Lauh Baker , and James 
M* Morgan. 

Thia monograph preaenta an approach to analysis of substantial bodies of micro- 
data vhich ia incorporated in the OSIRIS III program AID3* The OSIRIS IV 
iSSARCB coanand ia a direct deacendant of the AIfi3 program and aeveral new 
fi>taturea have been added* A nev monograph ia in progreaa* bvt ia not expected 
to be available until late 1980 • Reviaed edition^ 1974« Price: $6«S0 paper- 
bound » $10*00 clothbound* 

THAID: A Sequential Analyaia Program for the Analyaia of Nominal Scale Depen- 
dent Variablea ^ by Jamea N* Morgan and Robert €• Meaaenger. 

niia monograph deacrlbea a technique for conducting multivariate analysis of 
categorical dependent variablea* Although common in aocial reaeareh» auch vari- 
ablea have^ until recently » been difficult to handle vlth available atatiatical 
techniquea. TBAZS deacribea a aearching proeeaa vhich providea an efficient and 
effective me ana for aorting through a variety of analytic modela to find the one 
moat able to produce uaeful predictiona* The technique calla for aubgroupa that 
differ maximally aa to their diatribution; it aaaumea neither additivity nor 
linearity » and ao requirea aubatantial aamplea of 1»000 or more caaea« The 
OSIRIS IV command ftSEARQI incorporatea a modified veraion of thia technique as 
an option. 1973 • 98 pagea* Price: $8*00 dothbotaid* 

Hultinle eiaaaification A'^Jilyaia: A Report on a Comtmter Progrsm for Multiple 
Reereaaion Uaint Categorical Predictora ^ bv Prank Andreve> Jamea N. MorRan^ 
John A* Sonquiati and Laura Xlem* 

Multiple eiaaaification Analyaia ia a technique for exsmining the interrelation- 
ahipa betveen a^venl predictor variablea and a dependent variable vlthn the 
contest of an additive model* The OSIRIS IV IHCA command Implement a thia tech- 



id 
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nlque, Aod la a direct descendent of the progrsa described in this aooograph. 
Revised edition, 1974* lOS peges* ?rlce: $S«50 peperbotmd» $9 .00 clothbound» 

Mttltiveriete Hcnlnel Scele Anelysls; e Report on e Hev Analysis Technique , by 
Frank M* Andrew and Robert C* tWssenger* 

This aonograph describes a povtrful additive technique for conducting aulti-* 
variate analyses of categorical dependent variables; It la particularly useful 
for exploring the interrelationships among theoretical concepts tapped by one 
categorical dependent variable and substantial numbers of categorical Indepen- 
dent varlablea* Thia technique, already successfully incorporated into the 
OSIRIS III program MNA, vill be added soon to OSIRIS IV under the some name. 
1973. 114 pages. Price: 95.00 paperbouad, $8.00 clothbound. 

Data Processing in the Social Sciences with OSIRIS , by Judith Rattenbury and 
Paula Pelletler. 

This aonograph is intended to guide researchers in the field of social science 
(or their assistants) through all the stages necessary for processing data vlth 
a computer. It latrodueea the baalc eoaponents of coaputers and the different 
kinds of softvare necessary for using a computer and then diseueaes types of 
data and seae of the preliminary data collection phasee prior to coaputer 
processing. The aonograph goes step by step through the data proceaslng stages 
vhlch auat be aeeoapllshed before analysis can be undertaken. It outlines 
different kinds of analysis and describes the kinds of errors commonly aade vhen 
using a computer for data processing, and gives some hints on how to avoid them. 
Although the examples uaed are designed for use with OSIRIS III, they uy 
readily be tranalated for use vlth OSIRIS IV or another system. 1974. 24S 
pages. Price: S6.00 paperbound, $10.00 clothbound. 

A Guide for Selecting Statistical Techniques for Analyeing Social Science Data , 
by Frank M. Andreva, Laura Klea, Terrence N. Davidson, Patrick H. 0'Malley,~and 
Willard L. Rodgers. 

The Cuide is intended to be useful to social scientists* data analysts, and 
graduate students who already have soae knowledge of social science statistics. 
It presents a systeaatlc but highly condensed overview of over 100 currently 
uaed statistics and statistical techniques and their uses. The core of the 
Guide— a decision tree— conalsta of 16 pages of sequential questions and answers 
which lead the uaer to the appropriate technique. 1974, third printing 1976. 
38 pages. Price: $3.00 peperbound» five copies for $10.00 ($2.00 for over-the- 
counter cash purchase from ISR suppliea) . 



The following documentation may be ordered from: 

CanBR PGR POUTICAL STUDIES 
Poet Office Box 1246 
. Ann Arbor, Michigan 48106 

OHDSs An Introduction to the OSIRIS Hlerachical Data Structures Capabilities . 

This manual offers a step-by-step presentation on the generation, aodification, 
and use of .hlerachical data structures. It haa many examples and diagrams, and 
is an Important aid for understanding the new and powerful OSIRIS IV Hierarchi- 
cal Data Structuriifl^ 1979. 100 pages. Price: $S.OO. 
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Iton 61 11:58 Apr09/81 31 lines 
NGal Van Eck 

DSIRIS IV Kanual, pcD^jraas. 

k a^if DSIRIS IV iraual is now availafale fron ISR SappLies. It reflects 
changes which have bean made since the June 1980 update ti the sixth 
edition, Th2 manual las baen cooipletaly revieved and changes have 
beei 3iade to clarify isage where needed. In particular/ the &a?DAIE 
writa-up has baen completaly ravisad and includes more examples/ per 
usar suggestion. Substantial changes have also bean mcle to the 
F'J KD AaZNT ALS saction, and a new section on structured files has baan 
added to the DATA liANAGEHENT chapter. The SEECODE section has been 
coiplately restructured for ease of use. New programs which have bea 
added are: 



5AG5REG Aggregates individual records across subsets defined by 
the isor and computes summary statistics. (Available 
about May 1) 

5CAP Spatial configuration analysis. 

5CBLIST Prints OSIRIS dictionary-codeb ooks in a sophisticated 
format. 

SFkEE Permits data to be read into OSIEIS IV in a format free 

manner. 



SSASFILE Beads a SAS internal f ile. ^ 

&3PSSPILE Esads aa SPSS internal file. 

Dther additions inclaie aa expanded description of the 6SET coajaad, 
new storage typ^ (half-byte integer binary — see iiRANS), a new option 
(oae deck) for SBEBCHECK/ and a complete description of the "global 
delimiter"" option (in the PONDABENTALS section). 
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